Malayalam Subword Tokenizer

Malayalam Subword Tokenizer

Let’s start with the obvious question, what is a tokenizer? A tokenizer in Natural Language Processing (NLP) is a text preprocessing step where the text is split into tokens. Tokens can be sentences, words, or any other unit that makes up a text. 

Every NLP package has a word tokenizer implemented in it. But there is a certain challenge associated with Malayalam tokenization.

(more…)