Who said you can’t teach an old dog new tricks? We’ve brought dated Optical Character Recognition (OCR) technology into modern times with multi-pass OCR software.
Perfect OCR Isn’t so Difficult
Want to know the secret? Remove everything that isn’t text – it makes the OCR engine’s life so much easier.
Grooper does this through industry-first image processing tools and out-of-the-box configurations designed for the task.
The best part is that these tools won’t alter the original version of the image that you want to permanently retain. Whether you have paper documents or electronic, Grooper provides the best results in image processing and capturing text.
What’s the best OCR software? We get asked that all the time. OCR alone is far too inaccurate. The answer is combining Grooper’s image processing and recognition tools with one of many off-the-shelf engines such as Transym 4/5, Tesseract, Azure, ABBYY, Prime, etc.
(Not sure what Grooper is? No problem. Learn more about Grooper data integration.)
How Grooper Prepares a Document Image for OCR:
Tips for Accurate OCR
No matter how clean and pristine document images appear, text recognition software still struggles to collect accurate text. Text in images, in multiple columns, and in different font sizes all contribute to bad character recognition.
Another cause of inaccurate capture is that recognition engines process pages from top to bottom. (Hint: A better approach is OCRing select areas of a page and combining the results together.)
Grooper’s patented OCR synthesis engine intelligently performs multiple passes on different portions of a document image. Results are grouped together as a single unit, providing highly accurate text results.
In a lab test, Grooper accurately captured 99.91% of text. Using OCR alone on the same data set proved half as accurate.
Synthetic OCR – 5 Features that Guarantee Accurate OCR
#1: Iterative OCR – Capture Missed Text
Iterative OCR captures text missed on the first pass.
In order to get the rest of the text, Grooper runs OCR multiple times. On each pass, recognized text is dropped out of the document image. The recognition engine finds any remaining text.
Because each new pass has less distractions, finding missed text is easier.
#2: Cellular Validation – Capture Columns of Text
Multi-column layouts present a unique challenge for OCR, especially when columns of text are offset, or have different fonts, or font sizes. These are typically documents like invoices and purchase statements.
Standard OCR software will fail on at least one of the columns of text. Because Cellular Validation splits an image into a grid, each area is processed independently.
The result: industry-leading accuracy for reading and processing documents.
#3: Bound Region Detection – Capture Text in Boxes
A bound region is a page section which is bound on all sides by lines. Grooper’s Bound Region Detection changes the order of character processing by starting first with the text inside of boxes.
This ensures that content outside of text boxes doesn’t cause confusion when reading content inside text boxes.
Extracted text from each box is removed from the document image before performing full-page OCR. Because the location of the text boxes is understood, all text is intelligently joined back together.
Intelligent Spell Correction
Powered by Atomic RegEx, Grooper performs corrections to fix some pretty ugly stuff. And what is the secret to making this work? A few tools, like K-Means Clustering, text removal, and text correction engines.
What Spelling Errors Does Grooper Correct?
- Simple capture mistakes in strings that don’t match words in a standard dictionary
- Human-generated typos on documents
- Word splitting – insert spaces where OCR falsely jammed multiple words together
- Delete strings of characters that are not numbers or letters, like strings that resemble an attempt at censorship, like “$#@! ^&*”
- Repair numbers, such as prices, where overly-aggressive image cleanup mistakenly removed punctuation
How to OCR a PDF Document
PDF is the most widely used document standard in the world. Because there’s no standard for generating a PDF, capturing text has varying levels of difficulty:
- Some PDFs are purely text-based (easy to capture from)
- Others are just document scans in PDF format (difficult)
- Others PDFs have combinations of the two scattered throughout pages (most difficult)
PDF documents have a fair amount of text capture challenges.
How to Get Text off of PDFs:
Grooper looks at each page within a PDF and places the page into one of three categories: image-based, text-based, or mixed-content.
By doing this automatically, specific rules and processing methods make text extraction easier.
Then, each page is handled accordingly:
- Process PDF pages that have a single image covering the entire page as image-based pages
- If a PDF contains no images, extract only the raw text-behind the page
- For mixed-content pages, extract each image to a temporary image, process the image, and merge the results with the native text
Grooper OCR is trainable. The engine supports training custom and difficult font formats.
Grooper’s “Run Speed” option provides control to achieve the ideal balance between accuracy and performance.
Grooper recognizes 268 distinct languages and 523 regional cultures. Language detection interprets dates, times, currency names, numeric formats, and more.
Grooper avoids OCR altogether when dealing with original text-based files like Word, Excel, and Text PDFs. Instead, Grooper pulls complete and perfect text directly from the file.
Supercharge Your OCR – Save Time & End Manual Data Entry
Getting accurate capture results from old, or poor quality scanned documents used to be almost impossible. And especially tough if you needed to save a human-readable copy. With Grooper you get high accuracy and a great looking document image.
Watch this webinar to learn:
Save thousands of hours of work and get far better data!