Imagine that you are working on machine translation or a similar Natural Language Processing (NLP) problem. Can you process the corpus as a whole? No. You will have to break it into sentences first and then into words. This process of splitting input corpus into smaller subunits is known as tokenization. The resulting units are tokens. For instance, when paragraphs are split into sentences, each sentence is a token. This is a fairly straightforward process in English but not so in Malayalam (and some other Indic languages). 

Here is a sample paragraph from Wikipedia:

In computer science, artificial intelligence is intelligence demonstrated by machines, in contrast to the natural intelligence displayed by humans. Leading AI textbooks define the field as the study of "intelligent agents": any device that perceives its environment and takes actions that maximize its chance of successfully achieving its goals. Colloquially, the term "artificial intelligence" is often used to describe machines that mimic "cognitive" functions that humans associate with the human mind, such as "learning" and "problem-solving".

And here is the tokenized list of sentences:

  1. In computer science, artificial intelligence is intelligence demonstrated by machines, in contrast to the natural intelligence displayed by humans. 
  2. Leading AI textbooks define the field as the study of "intelligent agents": any device that perceives its environment and takes actions that maximize its chance of successfully achieving its goals. 
  3. Colloquially, the term "artificial intelligence" is often used to describe machines that mimic "cognitive" functions that humans associate with the human mind, such as "learning" and "problem-solving".

We got these results because there is a punctuation mark—the period—at the end of each sentence. The end punctuation can be a question mark, exclamation, etc. Some abbreviations also carry a period (for example, "U.S." or "Prof. John"). There can be a question mark or exclamation mark within a quote too (for example, “Did you buy it? I thought you wouldn't do it.” James commented.). NLP libraries are capable of handling these exceptions in the English language.

How NLP Solutions Handle Tokenization

Most of the popular Natural Language Processing (NLP) libraries have their own sentence tokenizers. Some of them take a rule-based approach while others take a neural network-based approach. The former identify exceptions based on rules. Abbreviations or quoted text are a major exception. In a neural network-based approach, the model can be a simple Recurrent Neural Network (RNN), Long Short-Term Memory (LSTM), etc. Packages such as NLTK and spaCy take a rule-based/algorithmic approach. FlairNLP doesn’t have its own tokenizer but has an integrated segtok package, which is a rule-based tokenizer. StanfordNLP has a neural pipeline with LSTM as a tokenizer. 

Malayalam Sentence Tokenization Problem

The above mentioned packages perform well with respect to English paragraphs. But they are not up to the mark when it comes to Malayalam. None of these packages has built-in support for Indic languages, particularly Malayalam.

Unlike English, Indic languages have many special conditions:

  • Malayalam abbreviations are typically lengthier than English ones. For example: B.Sc is a three-character abbreviation for Bachelor of Science but the same abbreviation in Malayalam (ബി.എസ് സി) has seven characters.
  • Some Indic languages have their own punctuation to end the sentence. For example: Hindi has पूर्ण विराम (|).
  • Older Malayalam texts may not carry any end punctuation at all. 

Since there are many such exceptions, the rule/algorithm-based sentence split implementations in popular NLP libraries will not work here.

The Search for a Tokenization Solution

I began exploring the possibility of adding Malayalam sentence segmentation support for NLP libraries. It was a multi-step process. More on it below.

Data Collection and Preparation

It takes lots of data to improve any NLP package. Having decided to use Wikipedia as my primary data source, I started collecting Wikipedia data dumps. In addition to the wiki dataset, I collected Malayalam news articles and blog posts (data extraction for NLP part has been covered in a previous article). The next step was cleaning the data set and getting rid of unwanted parts like English sentences, URLs, etc. I prepared a script using regex to run the cleaning process.

Evaluation Metrics

To evaluate the models, I chose Precision and Recall as the metrics. I chose a few articles from Malayalam Wikipedia dataset to generate a standard test set and tokenized them manually to generate proper ground truth data for the evaluation process. To generate benchmark values, I evaluated the default tokenizer from the commonly used NLP package NLTK. The performance of the NLTK sentence tokenizer on my standard test set had a precision of 0.90 and recall of 0.95.

Implementation

1. Natural Language Toolkit (NLTK)

NLTK makes use of PunktSentenceTokenizer, which is implemented as an unsupervised algorithm. This algorithm learns the list of abbreviations, collocations, and words used at the beginning of the sentences. The PunktSentence tokenizer performs the sentence segmentation in two stages. In the first stage, it performs type-based classification, where it generates annotations as abbreviations, ellipses, and non-abbreviation sentence boundaries. In the second stage comes token-based classification, where it applies additional heuristics on initial annotations to refine the result. It identifies the initials and the ordinal numbers missed in the first stage. Since this training process is unsupervised, it was easy to train the system with the dataset that I had prepared. After a few hours of the training process, the Punkt tokenizer model was ready to serve. Evaluation with the previously created test set showed a precision of 0.914 and recall of 0.954. Further fine-tuning didn’t have an effect on these values.  

2. StanfordNLP

In StanfordNLP, the tokenizer is a neural pipeline and supervised training is required. The tokenizer is implemented with a bi-directional LSTM. To apply the StanfordNLP tokenizer, the training data should be in CoNLL-U Format. I used the trained NLTK Punkt model as well as a verification process to generate the ConLL-U dataset. Once the dataset was ready, I trained the model for 20K steps but found the resulting model performance to be poor. I tried fine-tuning the hyper-parameters but saw no improvement.

3. FlairNLP

Next up was flairNLP, another popular NLP library. Flair doesn’t have a built-in tokenizer; it has integrated segtok, a rule-based tokenizer instead. Since flairNLP supports language models, I decided to build a language model for Malayalam first, which would help me build a better sentence tokenizer. I used the test data stream as input for the new model and split the stream into sentences where the model predicted the end of the sequence. The precision on this approach was very high, close to 1, but recall was poor—less than 0.05.

4. Custom Rule-Based Tokenizer

Compared to the neural-based packages, the rule/algorithmic-based NLTK Punkt tokenizer fared best though all of them were trained on the same number of input samples. This made me think of implementing a custom rule-based sentence tokenizer module, which would allow people to flexibly add more rules for Malayalam tokenization. I generated a few rules by analyzing the large corpus used to train other models and converted those rules into code. To handle exceptions, I generated a list of abbreviations with the support of NLTK training process and put the tokenizer to test. The custom rule-based tokenizer showed the highest precision and recall of 1. 

Conclusion

An important finding of my investigations into Malayalam sentence tokenization using NLP packages is that rule/algorithmic solutions perform better compared to neural pipelines. The neural pipeline approach may perform better on an improved language model for Malayalam, which we don't have yet.

This is not the end of the research, of course. A larger dataset is required for further investigations. My team has processed common crawl data for the same and it’s publicly available. My rule-based tokenizer code is also public. I have raised a pull request to NLTK Data repository to make Malayalam Punkt tokenizer officially a part of the NLTK library. Till NLTK accepts the request, you can download the model from the pull request itself. If you are interested in Malayalam tokenization, do check our previous article on Malayalam Subword Tokenizer.