Building an Intelligent Invoice Processing Solution – Part 2

In Part 1, we discussed how we extract text from invoices using PaddleOCR. Here we will outline the steps for retrieving relevant information from the extracted text using deep learning.

The invoices we had were dissimilar, so a template-based data extraction approach was out of the question. A Question-Answering (QA) Model was more appropriate in our case as the information we wanted to retrieve was present in the extracted text.

Question-Answering Model Selection

Question-Answering models are deep learning models that can retrieve answers to questions from a given context.

There are three QA variants:

  • Extractive QA: This model extracts answers from a context, which could be a text, table, or HTML.
  • Open Generative QA: This model generates answers in complete sentences based on the context. These answers may have additional words that were not present in the input text.
  • Closed Generative QA: No context is provided. Answers are generated based on the dataset the model is trained on.

Of the three, Extractive QA was the apt one in our case since we had a context and answers had to be extracted from that context without any text changes.

QA Model from HuggingFace

In order to implement the QA model, we chose the transformers library from HuggingFace, a data science platform that provides state-of-the-art (SOTA) pre-trained models for different tasks in NLP and computer vision along with the tools to build, train, and deploy those models. 

HuggingFace provides different variants of BERT, one of the most popular and widely used NLP models. BERT models consider the full context of a word by looking at the words that come before and after it, which is particularly useful when it comes to understanding the intent of the queries.


Initially, we used a pre-trained model of ALBERT, which is an upgraded version of BERT, a popular NLP model. The ALBERT model had an accuracy of more than 94% but extraction took more than 3 minutes on the CPU. Though it took less than 8 seconds on GPU, due to cost constraints, we decided to deploy the model on a CPU machine with 4GB RAM.

High-level architecture of ALBERT
High-level architecture of ALBERT

Intel / Dynamic TinyBERT

We looked for an alternative and found one in Dynamic TinyBERT, Intel’s lightweight model optimized for CPU. TinyBERT is 7.5x smaller and 9.4x faster in inference compared to BERT.

High-level architecture of TinyBERT
High-level architecture of TinyBERT

When we conducted a batch test on a set of invoices, the model performance improved from 240 seconds to 8 seconds.

Though we got the output fairly quickly, accuracy was lacking compared to the ALBERT model. So we decided to fine-tune the model with our custom dataset.

Dataset Preparation

For fine tuning  the QA model, we created our own custom dataset in SQuaD (Stanford Question Answering Dataset) format.

 Sample SQuaD Format

Data Annotation

In order to create our custom dataset in SQuAD format, we had to annotate the data (add labels and instructions to raw text) which will enable the machine to understand and recognize sentences and other textual data that are structured for meaning. 

Among the many annotation tools, we chose to go with the open-source tool Label Studio due to its ease of use. 

We had three outputs to extract: invoice number, invoice date, and invoice amount from each invoice. The questions we used were “What is the invoice number?”, “What is the invoice date?”, and “What is the invoice amount?”. 

The below image shows how data is annotated in Label Studio.

Data annotation in Label Studio

We annotated a dataset consisting of 200 fuel, mobile, and internet bills, which were in image and pdf formats. After annotating the bills, we exported the output in JSON (See sample below).

Model Fine-Tuning and Inferencing

Our next step was to fine-tune the pre-trained TinyBERT model with the annotated data. We split the annotated data into train data and test data in a 7:3 ratio using Sklearn. We used the training script provided by simpletransformers library to fine-tune the model.

The fine-tuned model was loaded using the TransformerReader library. Then we fed the below questions and post-processed OCR output to the model for prediction.

  • What is the processing fee? What is the total amount?
  • What is the bill number? What is the invoice number?
  • What is the date?

Predictions with the highest score were taken as the final answer for each question.

Post-Processing of Extracted answers

We cleansed the extracted answers and converted them to standard formats as per business requirements. This included:

  • Changing the date into YYYY-MM-DD format.
  • Removing currency symbols from the amount field.
  • Converting amount in words to numerical.
  • Rounding off the amount to the next integer value.

Output Analysis

We tested our framework using 50 files from each category, like mobile screenshots of PDFs (low-resolution images), fuel bills in both images and PDF formats, and Internet/mobile digital invoices.

The results are shown below.

Intelligent Invoice Processing Results

The accuracy of the final output varied due to factors like:

  • OCR accuracy (Quality and resolution of the input image)
  • Size of the text content (Multi-page invoices)
  • Noise in the text content (Presence of logos and images)

Sample Outputs from Intelligent Invoice Processing framework

Sample input invoice

JSON Output

Output screen for single-page invoice
Sample multi-page invoice
Sample multi-page invoice
Output screen for multi-page invoices
Output screen for multi-page invoice

Business Benefit

The invoice processing system we implemented reduced the manual effort for the client’s accounts team by more than 70%. It is a generalized solution that can be easily leveraged for the extraction/verification of any other document by modifying the questions asked. 

For example, using the same framework, PAN cards can be parsed as shown below.

Sample PAN card input image

JSON Input

JSON Output

The highlight of this framework is that it is built using open-source technologies without any third-party commercial APIs.