So, you’re here because you have data on paper documents that you’d like to convert into digital text. Great!
When using computers to collect data and transform it into information, it’s helpful to understand that you must give them rules and criteria. The rules and criteria that computers use for collecting data off paper documents usually consists of pattern matching. This is where OCR comes in.
First, What is OCR?
Most people who have heard of OCR think of it as a feature included with another piece of software to perform word searches. But OCR is much more than that.
Optical Character Recognition, or OCR, has essentially been around since 1913. First used to interpret Morse code and assist the blind, OCR technology has continued to evolve. Grooper’s OCR technology is at the pinnacle of this evolution by creating an intelligent digital awareness of a document. But more on that later.
Using OCR technology, a computer compares patterns made by letters and numbers on scanned documents to a set of characters stored in the software.
If OCR and process of digitizing text is like a video game, then patterns make up the rules and criteria of the game.
We want our computer’s software to reliably recognize patterns, or pattern matching won’t work, and we won’t fare very well at this game.
Your Mission: Beat Poor Scan Quality and Low Contrast Images
As OCR works to recognize patterns, many things can confuse the technology and cause problems. To give you an accurate digital conversion, OCR needs black and white scans that are high resolution (quality).
Having a black-and-white scan creates high image contrast, making the job simple for OCR. When it comes to resolution, a scan of a document that is low resolution creates a lot of noise around the letters and numbers, confusing OCR.
Low black & white contrast and poor resolution are the villains that we must beat.
Grooper, thankfully, has excellent tools to help us easily defeat these villains. Using these tools will give you great black-and-white images, giving you the best starting point for OCR. Later, fuzzy data extraction will overcome poor resolution. But there are many more very tough villains ahead of us to defeat.
The Tools, or “Cheat Codes”, to Overcome OCR Problems
Ready for the cheat codes? Here’s what they’re called:
Just like any good cheat code, these let you break the rules of the game. These cheat codes, or tools, allow Grooper to overcome typical problems associated with full text OCR. Usually, you get one pass at an entire page of relatively complex symbols to get it right.
Each one of these tools that Grooper uses to win the OCR game give it the capability to understand where problem areas are, hone in on those, and get as good an OCR read as possible.
Let’s take a look at each one.
Level 1 Villain: Bounded Regions
Cheat Code: Bounded Region Detection
A bounded region consists of whitespace surrounded by lines. Or simply put, words in a box.
Imagine an invoice, purchase order, or delivery receipt. These kinds of documents are comprised of text in tables and boxes. Using Bounded Region Detection, Grooper finds those boxes, looks only at what’s inside them, and gets a great read on the contents.
Grooper just took what has typically been a nightmare for legacy image cleanup software to get rid of and used it to its advantage. It’s like a judo master that took someone twice their size running at them and used their momentum against them to flip them on the ground…with just his pinky.
Level 2 Villain: Segments
Cheat Code: Segment Reprocessing
A segment is a small block or line of text on a page. If any segment gets a low OCR confidence score, Grooper uses Segment Reprocessing to run OCR on that segment a second time. The outcome is much better results for each of these troublesome lines.
Level 3 Villains: Different Sized Fonts and Free-Floating Text
Cheat Code: Iterative OCR
Documents will frequently use different sized fonts and free-floating text that are not in alignment. OCR reads from left to right like we do. Therefore, if fonts on the left are of a different size or are out of alignment with text on the right side, typical OCR will generate poor results.
Grooper solves this problem with Iterative OCR, or by reading the document multiple times. The first time, it will read everything. The second time, using Iterative OCR, Grooper will drop out what it read well, and then only read what was previously poorly read.
The dropped-out text will no longer interfere with what’s left, resulting in a much better read. This process continues to repeat until the villain has been overcome.
Level 4 Villain: Multi-Column Layouts
Cheat Code: Cellular Validation
A page with two or more columns presents a challenge for typical OCR. Text in the columns may have different sizes, or the lines may be offset from one another. Many OCR processes will have a total breakdown with very inaccurate results.
However, Grooper uses Cellular Validation to create highly customizable OCR regions by splitting a document into specific rows and columns.
Rows and columns are defined by the Grooper user who understands the document’s structure and layout. Grooper then reads each of the individual rows and columns independently, understanding the difference in each section and how each section relates to the overall document.
Grooper’s Master Cheat Code: OCR Synthesis
Grooper combines the data produced from Bounded Region Detection, Segment Reprocessing, Iterative OCR, and Cellular Validation into one logical text flow. Advanced font awareness re-analyzes spaces, tabs and new line feeds during OCR Synthesis. Grooper ensures that all characters in a document are not just recognized but are assembled together in logical groupings.
OCR Synthesis sets Grooper apart from traditional OCR systems by providing a much-improved foundation for accurate and reliable data capture.
You’re About to Beat the Game!
Grooper has the instant-win button, the buzzer-beating 3 pointer, the ability to learn Kung Fu instantly by plugging into the Matrix, the well, you get the point.
But here’s the rub. Few things in life are ever 100%, and OCR results are one of them.
Not even with Grooper’s awesome cheat codes can OCR recognize and accurately convert 100% of a document’s text. But with these tools, or cheat codes, we are far closer to correctly converting 100% of a document’s text than we were previously.
The final nail in the coffin of OCR problems is named Fuzzy Regular Expression. We’ll address that in an upcoming blog.
Until then, I wish you the best in your document processing tasks and happy gaming!