How to Classify Documents With OCR and Machine Learning
In bygone days, companies had large mailrooms to process incoming documents. The vast variety of documents that came in made it hard to work efficiently. The one could be an invoice from a supplier, while the other could be a customer service letter or a fine. Every one of them had to be forwarded to the right department, was manually processed, and ended up in a big archive full of filing cabinets.
Over the past decades, many of these organizations digitized their systems, moving to digital mailrooms, document management systems, and archives. They receive most documents through emails now. For the last documents that come in on paper, they use large scanners to digitize them.
In our opinion, this is just the first step in a company’s digital transformation. If you really want to improve operational efficiency, you should take it to the next level. Classifying and sorting the content of documents, and making sure they are available in searchable text, are valuable next steps that can be automated in your document processing workflow.
In this blog, we will reveal how you can do this with the help of Optical Character Recognition (OCR) and machine learning.
The Secret Ingredient: Algorithms
Document digitization companies make use of machine learning algorithms, which are trained on a large set of documents. The algorithms are able to extract many document characteristics, such as file types, formats, and sizes.
But that’s not all.
The software also extracts the content of documents with the help of OCR and performs text and statistical analyses using NLP to determine topic clusters. By identifying patterns within sets of document types it is possible to match unknown documents to a certain set.
This works as follows:
- An unknown document is presented to the software.
- The characteristics and content are extracted and fed to the algorithms.
- This results in a similarity score.
- The similarity score is then compared to the document categories in the data set that the model was trained with.
- The best match between the similarity score and the category score is the most likely candidate for classification.
With such an automated workflow, it is possible to achieve an accuracy rate of more than 99%, while one sorting task takes around 0.1 second. Manual sorting is of course much slower as it takes a human at least a few seconds to sort a document. Also, humans generally don’t reach an accuracy rate higher than 95% depending on the complexity of the task.
Top OCR and Machine Learning Blog: Replacing Manual Data Entry With OCR and Machine Learning
If you translate this into a real-life situation in which you need to sort 100.000 documents, the manual route will take about 20 times longer and result in 5% more errors. You don’t have to be a mathematician to know that this will easily cost thousands of euros extra per month, while an algorithm only costs a fraction of that price.
The Use of Document Classification Software
Think about one of your company’s documents for a while. How many characteristics can you think of? File type, document type, language, country of origin? Basically, any feature that you can identify can be used to classify documents with document classification software, plus a little more.
The only requirement is a sufficient amount of data to train a machine-learning algorithm to understand the differences between certain features. In that regard, these algorithms are not very different from us humans. They learn about the differences between documents through experience.
To give you an idea of the possibilities, you will find a non-exhaustive list of the things you can do with document classification software below:
- Classification of file types
- Classification of document types
- Classification of document languages
- Classification of countries of origin
- Classification of merchants
- Classification of line items
- Classification of urgency
- Classification of privacy-sensitive data
Classification of File Types
The first step in most situations is to identify every stored file in your archive or database. With document classification software you can quickly sort, classify and label PDF and Word documents, Excel sheets, emails, images, and so on.
Classification of Document Types
When you know the file types in your archive, you can go one step further and classify the document type. You may want to know whether a document is an invoice or a receipt, a contract or a customer service letter, a bank statement or an identity document, etcetera. Document classification software automatically scans the data on the document and gives you the answer within seconds.
Classification of Document Languages
If you have documents in multiple languages, such as a contract or a user manual, it can be useful to classify the document language. Let’s say, you are looking for the German variant. How will you find it in an archive with thousands of documents?
With document classification software, you can label all documents with “English”, “German”, “Italian”, and so on, to make your life a little bit easier.
Classification of Countries of Origin
Also, the country of origin of a document can be classified. Documents such as shipping labels or passports contain information about the country of origin and can be labeled for sorting purposes.
Classification of Merchants
Merchant names can give you information about the type of store in which a purchase was made. This is especially useful when classifying or sorting receipts and invoices. Think of category labels such as supermarket or pharmacy, and update your database accordingly.
Classification of Line Items
Do you need more than just a “simple” document type classification? Do you want to know exactly which products are on a receipt or invoice? Some document classification solutions can read the line items on such documents and classify them into categories like “Food & Drinks”, “Transportation” or “Electronics”. Just imagine how useful this would be to determine tax return eligibility or analyze customer behavior.
Classification of Urgency
Remember that document classification software use NLP to determine topic clusters? This same technique can be used to determine priorities in large volumes of customer support tickets. Complaints or messages from angry customers can be classified as “high priority”, while a less alarming support ticket with regards to a product feature can be classified as “low priority”.
Classification of Privacy-Sensitive Data
Companies nowadays have to comply with strict GDPR or other privacy-related regulations. Losing privacy-sensitive data or accidentally opening it up to the public will not only lead to bad publicity and/or heavy fines, but it can also mean the end of your business.
That’s why it’s crucial to identify and classify documents containing privacy-sensitive information, such as passports, ID cards, or credit cards. Document classification software can automatically detect and label these documents for you. Or even better, anonymize them by removing or blacklining specific lines on a document.
Getting Started With Document Classification Software
If this article has sparked your interest in automated document classification, you might wonder where to start. Well, there are two steps you’ll need to follow:
- Gathering your data set
- Training your algorithm
Gathering Your Data Set
If you want to train your own machine learning algorithm, you’ll need to gather enough data. The data set needs to consist of enough documents for each category so that the algorithm can learn about the differences between them.
Moreover, the quality of the data set is crucial. If the examples that you train your algorithm with are incorrectly annotated, the model will learn from these mistakes and make the same mistakes when calculating its estimations.
Training Your Algorithm
It might become a bit technical from here, but once you have gathered your data set, you can start training your classification algorithm. There are many complex algorithms you can use, such as Naive Bayes and Support Vector Machines.
You can use open source tools like scikit-learn or TensorFlow to train these algorithms, but you’ll need to know how to code and have some basic knowledge of machine learning.
Luckily, there are proven solutions available for direct implementation!
Automated Document Classification With Klippa
Save yourself the hassle of months or even years of research, development, and testing, and start using OCR and machine learning for automated document classification today. There are many classification tools available on the market that make it super easy to start. Some of these tools don’t even need you to write a single line of code.
Klippa’s OCR API, for example, provides a plug-and-play OCR solution that you can get started with right away. You can integrate it with existing applications you use on a daily basis to efficiently classify your documents within seconds.
The possibilities of automated document classification are near endless. So if you have an archive that is in thorough need of organizing, it might be time to jump the bandwagon and solve all your archiving challenges!
[To share your insights with us, please write to sghosh@martechseries.com]
Comments are closed.