Who said you can’t teach an old dog new tricks? We’ve brought dated Optical Character Recognition (OCR) technology into modern times with multi-pass OCR.
Near-Perfect OCR Shouldn’t be so Difficult
The secret to high accuracy is removing all non-text elements.
Grooper does this through industry-first image processing tools and out-of-the-box configurations specifically designed for the task.
The best part is these tools won’t alter the original version of the image you will permanently retain.
What’s the best OCR software? The best is actually a combination. Use your own, or Grooper’s (or both).
How to Prep a Document Image for Optical Character Recognition
- Remove lines
- Clean up document edges
- Remove small specks
- Remove large non-text objects
- Invert white-on-black zones
- Remove hole punches
Tips for More Accurate OCR
No matter how clean and pristine document images appear, text recognition software still struggles to collect accurate text. Text in images, in multiple columns, and different font sizes all contribute to bad character recognition.
Another cause of inaccurate capture is that OCR engines process whole pages, from top to bottom. A better approach is OCRing select areas of a page and combining the results together.
Grooper’s patented OCR synthesis engine intelligently performs multiple passes on different portions of a document image. Results are grouped together as a single unit, providing highly accurate text results.
In a lab test, Grooper accurately captured 99.91% of text. Using OCR alone on the same data set proved half as accurate.
5 Features to Guarantee Accurate OCR
#1: Iterative OCR – Capture Missed Text
Iterative OCR captures text missed on the first OCR pass.
OCR is run multiple times, and on each subsequent pass, recognized text is dropped out of the document image. The OCR engine identifies remaining text.
Because each new pass has less distractions, finding missed text is easier.
#2: Cellular Validation – Capture Columns of Text
Multi-column layouts present a unique challenge for optical character recognition, especially when columns of text are offset, or have different fonts, or font sizes. These are typically documents like invoices and purchase statements.
Standard character recognition processes will fail on at least one of the columns of text. Because Cellular Validation splits an image into a grid, each area is processed independently.
The result: industry-leading accuracy reading and processing documents.
#3: Bound Region Detection – Capture Text in Boxes
A bound region is a page section which is bound on all sides by lines. Bound Region Detection changes the order of OCR processing by starting with text inside of boxes. This ensures content outside of text boxes doesn’t cause confusion reading content inside text boxes.
Extracted text from each box is removed from the document image before performing full-page OCR. Because the location of the text boxes is understood, all text is intelligently joined back together.
#5: Layered OCR – Capture Multiple Fonts and Handwriting
A document with multiple fonts makes accurate character recognition tricky. Overcome this using a different OCR engine for each document layer and combines the results.
One example of a document with mixed print types is a check, which includes standard text fonts, OCR-A, OCR-B, MICR fonts, and handwriting. Because some engines read certain fonts better than others, it makes sense to use the right tool for the job.
Another use case for this feature is label repair. Repair lines of text, and eliminate inaccuracies in field labels for greater accuracy and simpler data extraction.
Intelligent Spell Correction
Powered by Atomic RegEx, Grooper performs corrections to fix some pretty ugly stuff. And the secret to making this work? K-Means Clustering, text removal, and text correction engines.
What Spelling Errors Does Grooper Correct?
- Simple OCR mistakes in strings that don’t match words in a standard dictionary
- Human-generated typos on documents
- Insert spaces where OCR falsely jammed multiple words together
- Delete strings of non alpha-numeric characters that resemble an attempt at censorship, like “$#@! ^&*”
- Repair numeric values where overly-aggressive image cleanup inadvertently removed punctuation
How to OCR a PDF Document
PDF is the most widely used document standard in the world. Because there’s no standard for generating a PDF, capturing text has varying levels of difficulty:
- Some PDFs are purely text-based (easy to OCR)
- Others are just document scans in PDF format (difficult)
- Others PDFs have combinations of the two scattered throughout pages (most difficult)
PDF documents have a fair amount of text recognition challenges.
How to Get Text off of PDFs:
Grooper looks at each page within a PDF and places the page into one of three categories: image-based, text-based, or mixed-content. By doing this automatically, specific rules and processing techniques make text extraction easier.
Each page is handled accordingly:
- Process PDF pages containing a single image covering the entire page as image-based pages
- If a PDF contains no images, extract only the raw text-behind the page
- For mixed-content pages, extract each image to a temporary image, process the image, and merge the results with the native text
Grooper’s “Run Speed” option provides control to achieve the ideal balance between accuracy and performance.
Grooper recognizes 268 distinct languages and 523 regional cultures. Language detection interprets dates, times, currency names, numeric formats, and more.
Grooper avoids optical character recognition altogether when dealing with original text-based files like Word, Excel, and Text PDFs. Instead, Grooper pulls complete and perfect text directly from the file.