Who said you can’t teach an old dog new tricks? We’ve taken dated Optical Character Recognition (OCR) technology and brought it into modern times.
Grooper’s patented Synthetic OCR generates the most accurate text from images and electronic files, regardless of which OCR engine you use.
Great OCR Starts with Image Quality
Focusing on image quality first ensures future success.
Before any OCR action takes place, make sure you’re handing an image that’s free of artifacts (non-text elements). How do you better recognize text? First, remove everything from the page that isn’t text.
Grooper does this through industry-first image processing tools and out-of-the-box configurations specifically designed for this task. (To see these tools, check out our image processing.)
The best part is these tools won’t alter the original version of the image you want to permanently retain.
What Does Grooper Do Before Optical Character Recognition?
- Removes lines
- Ensures edges are clean
- Removes small specks
- Removes large non-text objects
- Inverts white-on-black zones
- Removes hole punches
How to Get More Accurate OCR
No matter how clean and pristine your images may appear, many recognition software still have a hard time collecting accurate text. Text in images, in multiple columns, and different font sizes all contribute to bad character recognition.
Another common problem that results in inaccurate capture is that nearly all engines process an entire page at once. Grooper solves this problem by focusing on select areas of a page and then synthesizing the results together.
Grooper’s patented OCR synthesis engine intelligently performs multiple passes on different portions of a document image. It then groups the results together as a single unit, keeping only the most accurate text results.
In a lab test, Grooper accurately captured 99.91% of text. Using only OCR on the same data set proved half as accurate.
5 Techniques That Get More Accurate OCR
#1: Capture Mixed Text with Iterative OCR
Iterative OCR is a technique we’ve developed as a way to capture text that the OCR engines simply miss the first time around.
To overcome this, we run a pass of OCR on the entire document, drop out any portions of the page where we were able to obtain text, then run another iteration on the new image.
With the new image having far less distractions, it is able to more clearly find text it missed during the previous passes.
#2: Capture Columns of Text with Cellular Validation
Multi-column layouts present a unique challenge for optical character recognition. Text on each side of a document may have different font sizes or the lines of text may be slightly offset from each other.
A standard character recognition process will have a complete breakdown of accuracy in one of the two sides. However, Grooper’s Cellular Validation splits the image into a grid of multiple areas and processes them one by one.
The result: industry-leading accuracy when it comes to reading and processing your documents.
#3: Capture Text in Boxes with Bound Region Detection
This method changes the OCR process so that the information in boxes is processed before the full page. This ensures that other content on the page won’t interfere with what’s in boxes.
OCR is first performed on each box. Then the content is removed from the image before full-page OCR. This also improves overall page accuracy.
#5: Accurately Capture Different Fonts (and Handwriting) with Layered OCR
Different fonts on one document causes problems for many character recognition engines. To overcome this, Grooper uses an approach which combines the output of many engines on a single document.
A main OCR profile is run for one font followed by profiles for other fonts to ensure the desired data is extracted properly.
One example of a document with mixed print types is a check, which includes standard text fonts, OCR-A, OCR-B, MICR fonts, and handwriting. No single character recognition engine reads all these consistently. But thanks to our layered approach, this is not a problem for Grooper.
Intelligent Spell Correction During OCR
Powered by our Atomic RegEx engine, Grooper performs corrections to fix some pretty ugly stuff. What’s our secret? Tools such as K-Means Clustering, text removal, and text correction engines.
What Spelling Errors Does Grooper Correct?
- Simple OCR mistakes in strings that don’t match words in a language of your choice.
- Existing, human-generated typos on documents.
- Re-inserting spaces where OCR falsely jammed multiple words together.
- Deleting strings of non alpha-numeric characters that resemble somebody’s attempt at censorship, like “$#@! ^&*”.
- Repairing numeric values where overly-aggressive image cleanup has inadvertently removed commas and periods.
How to OCR a PDF Document
PDF has become the most widely used document standard in the world. But PDFs can be constructed in different ways which make capture difficult in some cases:
- Some PDFs are purely text-based (easy to OCR)
- Others are just images re-packaged into a PDF format (difficult)
- Others PDFs have combinations of the two scattered throughout pages (most difficult)
As a result, PDFs present different challenges you’ll have to face in order to get the best text from every page. Thankfully, Grooper has tools to get around these obstacles…
How to Get Text off of PDFs:
Grooper looks at each page within a PDF to place the page into one of three categories: image-based, text-based, or mixed-content.
Each page is handled accordingly:
- If a PDF page contains a single image which covers the entire page, it is considered an image-based page, and is processed using the best OCR software.
- If a PDF contains no images, we extract only the raw text-behind the page.
- For mixed-content pages, each image on the page is extracted to a temporary image. Each temporary image is processed. Then the results are intelligently merged with the native text.
Grooper’s “Run Speed” option gives you control to achieve an ideal balance between accuracy and performance.
Grooper recognizes 268 distinct languages which can be individually enabled or disabled. Language detection interprets dates, times, currency names, numeric formats, and more.
Grooper avoids optical character recognition altogether when dealing with original text-based files like Word, Excel, and Text PDFs. Instead, Grooper pulls complete and perfect text directly out of the file.