How did we increase Data Extraction accuracy by a whopping ~50%?

What if I were to tell you that you have to enter receipt details manually for your business expenses to get reimbursed, sounds painful right?

Hey there, I am Madhav Mansuriya, I joined Fyle as an intern back in 2019 (even before I graduated I was a part of Fyle 😛, lucky me!) and now I own an entire module called Data Extraction (DE). From fixing small bugs to re-writing the whole mobile app and DE  service, I have evolved a lot!. Today I wanna share a small story of how we improved our Data Extraction accuracy by 50%.

This journey taught me, from how to work as an intern to how to work with an intern!

What is Data Extraction?

We at Fyle strive hard to reduce/remove any manual effort needed from our users to submit their business receipts and we think it a magical experience!

The capabilities of our module helps us in making the experience of submitting expenses buttery smooth, just click the photo of the receipt, sit back and relax 🏖

Once the receipt is uploaded all the required data from the receipt will be automatically extracted and your expense will be submitted. Hard to believe right? I know! that is why I said it is magical.

Our system is so robust that it supports various types of receipts, currencies, vendors, and whatnot! Don’t believe me? See for yourself

The Magic of Data Extraction

How does the DE module work?

Fyle supports paper(photo of the receipt) and digital(PDF invoices and email forwards) receipts. When the user uploads the receipt/invoice from any client (web app, mobile app, email forwards, etc.) our DE module comes into play.

Earlier we used to use Google Vision to extract the data from the receipt, our models built on top of Google Vision try to find the required details like Amount, Currency, Date of Spend, Category of Spend, and Merchant.

  • To extract details like the amount and category we had machine learning models that helped in classifying and extracting the required data.
  • Receipts of different currencies would have details about the date and currency present in all kind of formats and positions, our smart logic layer made it very easy to extract and classify the relevant info.
  • Guess what? The trickiest part in the receipt is extracting the merchant name, extracting it is very difficult because of a few reasons like thousands of merchants across the globe, hundreds of fonts, tons of receipt formats, and a combination of all these makes it extremely difficult to extract the merchant name with very high accuracy.

What was the accuracy before?

  • What the heck is accuracy?: Accuracy here means how many data points are getting extracted from the given set of receipts
  • How is accuracy calculated?: It is calculated based on what people like you and me would see in a receipt vs what the machine learning model would extract
Old Model's Accuracy

What magic did we do to get to our goal?

We were using the old model for almost 5+ years and a boost in accuracy was long overdue. We named our experiment DE-v2, we had heard about Amazon Textract which helps in synchronously analyzing an input receipt/invoice for financially-related relationships between text!

Let us go on a little technical joy ride? Shall We?

It was time to experiment with Textract and we did some research and built a bunch of POCs.

Like with all experiments, some failed quickly, some others failed rather horribly, but I strongly believe in the words, Success is sweet, but the secret is sweat.

We ended up getting some good results with the following two APIs:

  • Analyze Expense: It synchronously analyzes an input receipt for financially-related relationships between text. Ex: Amount, Date, Merchant name, etc.
Sample JSON from AWS Textract

We got some initial results when experimenting, but the results were not up to the mark! The idea was to develop smart layers on top of Textract to get the most accurate data. Once we get the data, we rely on an internally developed intelligent layer to achieve maximum accuracy without increasing the error rate.

This is all good, show me the NUMBERS!

What is the accuracy now?

New Model's Accuracy

The Results

  • The accuracy of the date field for receipts was insane, it went up from 43.46%(paper + digital receipts) to 74.85%, a whopping 72.22% increase.
  • Amount fields` accuracy increase was a solid 51.53% (from 45.37% to 68.84%)
  • The category of the receipt is the most diverse area of extraction, to classify the receipt into categories, a lot of parameters needs to be considered, we had an increase of ~169% (in all honesty the old accuracy was low, from 16.53% to 44.47%)
Old vs New Accuracy for Paper Receipts

Old vs New Accuracy for Digital Receipts

My Learnings

  • Textract, a new toy got added to my list of experiments, I will be adding some Sci-Fi features on top of this 🤞
  • I got a chance to build a new micro-service from scratch in Python
  • I joined the DE team a year ago as a noob and today I run the show. The things that I take care of now include maintaining the service, building the test infrastructure, analyzing all old services + old test infra deeply and merging all 5+ years of work to a single new service in just a month’s time, sounds awesome right? Of course, it is AWESOME!

One Last Thing

The reason I could build the entire infrastructure in a month is because of our test infrastructure, here is how it looks when run

https://cdn-images-1.medium.com/max/1600/1*OYwJL0uRYK01o1IgFmWVVw.gif

Our test infrastructure helped us get the clean and slick stats presented before, having the infra helped us push things confidently to production at insane speeds.

The module is strongly backed by our testing infra, from a pocket-size change to a full-fledged feature, everything has to pass through its strict assessment. If it finds anything odd, bang! the scripts will pinpoint the failing test cases and notify the changes are not yet production-ready. All of this is automated, sounds fascinating right? Yes, it actually is! (An article about that is coming soon)

Madhav Mansuriya

More of our stories from

Engineering
Demystifying Class Variables In Python

Understanding class variables in python

Read more...
Interview Experience: Backend Engineering Internship at Fyle

Wanna know the secret to crack backend engineering interviews? Learn them here and intern at Fyle!

Read more...
The curse of being a Senior Engineer, how to deal with timelines, frustrations, etc

Being a good developer is 50% skill and 50% emotional support; here's my secret to balancing both at the right amount!

Read more...
How did I build cropping of receipts in the mobile app?

Follow Yash's journey of what it takes to reduce manual work for our customers when receipts come in all shapes and sizes!

Read more...
How did we increase Data Extraction accuracy by a whopping ~50%?

Wanna know the secret of data extraction, the complex machine learning models we use, the experiments we did? Read on...

Read more...
The not so secret sauce of my work

From chaos to clarity, follow Chethan's not so secret sauce to excelling at work!

Read more...
From Zero to Hero: The Policy Tests Journey!

The story of policy tests at Fyle

Read more...
How Fyle changed my life from a naive intern to a confident Engineering Lead

A blogpost that documents Shwetabh's journey at Fyle.

Read more...
Vikas Prasad @ Fyle

This document is a user guide to Vikas at work.

Read more...
Gokul K's README

This document is a user guide to Gokul at work.

Read more...

All Topics