PyTorch Lightning: A Better Way to Write PyTorch Code

PyTorch Lightning: A Better Way to Write PyTorch Code

Scaling machine learning pipelines using PyTorch can be a pain. 

You typically start a PyTorch-based machine learning project by defining the model architecture. Then you run it on a CPU machine and progressively create a training pipeline. Once the pipeline is done, you run the same code on a GPU or TPU machine for faster gradient computations. You update the PyTorch code to load all the tensors to the GPU/TPU memory with a ‘.to(device)’ function call. Now comes the difficult part: what if you want to use distributed training for the same pipeline? You have to overhaul the code and test it to make sure nothing is broken.

Why sweat the small stuff? Let’s use PyTorch Lightning instead.

(more…)
Malayalam Subword Tokenizer

Malayalam Subword Tokenizer

Let’s start with the obvious question, what is a tokenizer? A tokenizer in Natural Language Processing (NLP) is a text preprocessing step where the text is split into tokens. Tokens can be sentences, words, or any other unit that makes up a text. 

Every NLP package has a word tokenizer implemented in it. But there is a certain challenge associated with Malayalam tokenization.

(more…)