Great OCR Starts with Image Quality

Focusing on image quality first ensures future success.

Before any OCR action takes place, make sure you’re handing an image that’s free of artifacts (non-text elements). How do you better recognize text? First, remove everything from the page that isn’t text.

Grooper does this through industry-first image processing tools and out-of-the-box configurations specifically designed for this task. (To see these tools, check out our image processing.)

The best part is these tools won’t alter the original version of the image you want to permanently retain.

OCR Line Removal Animation

What Does Grooper Do Before Optical Character Recognition?

  • Removes lines
  • Ensures edges are clean
  • Removes small specks
  • Removes large non-text objects
  • Inverts white-on-black zones
  • Removes hole punches

How to Get More Accurate OCR

No matter how clean and pristine your images may appear, many recognition software still have a hard time collecting accurate text. Text in images, in multiple columns, and different font sizes all contribute to bad character recognition.

Another common problem that results in inaccurate capture is that nearly all engines process an entire page at once. Grooper solves this problem by focusing on select areas of a page and then synthesizing the results together.

Grooper’s patented OCR synthesis engine intelligently performs multiple passes on different portions of a document image. It then groups the results together as a single unit, keeping only the most accurate text results.

In a lab test, Grooper accurately captured 99.91% of text. Using only OCR on the same data set proved half as accurate.

5 Techniques That Get More Accurate OCR

Document Segment Reprocessing Animation

Intelligent Spell Correction During OCR

Powered by our Atomic RegEx engine, Grooper performs corrections to fix some pretty ugly stuff. What’s our secret? Tools such as K-Means Clustering, text removal, and text correction engines.

Spell Correction

What Spelling Errors Does Grooper Correct?

  • Simple OCR mistakes in strings that don’t match words in a language of your choice.
  • Existing, human-generated typos on documents.
  • Re-inserting spaces where OCR falsely jammed multiple words together.
  • Deleting strings of non alpha-numeric characters that resemble somebody’s attempt at censorship, like “$#@! ^&*”.
  • Repairing numeric values where overly-aggressive image cleanup has inadvertently removed commas and periods.

How to OCR a PDF Document

PDF has become the most widely used document standard in the world. But PDFs can be constructed in different ways which make capture difficult in some cases:

  • Some PDFs are purely text-based (easy to OCR)
  • Others are just images re-packaged into a PDF format (difficult)
  • Others PDFs have combinations of the two scattered throughout pages (most difficult)

As a result, PDFs present different challenges you’ll have to face in order to get the best text from every page. Thankfully, Grooper has tools to get around these obstacles…

Getting Text off a PDF Animation

How to Get Text off of PDFs:

Grooper looks at each page within a PDF to place the page into one of three categories: image-based, text-based, or mixed-content.

Each page is handled accordingly:

  • If a PDF page contains a single image which covers the entire page, it is considered an image-based page, and is processed using the best OCR software.
  • If a PDF contains no images, we extract only the raw text-behind the page.
  • For mixed-content pages, each image on the page is extracted to a temporary image. Each temporary image is processed. Then the results are intelligently merged with the native text.

Additional Tools

OCR Performance Balancing

Performance Balancing

Grooper’s “Run Speed” option gives you control to achieve an ideal balance between accuracy and performance.

Multi-Language Text Support

Language Support

Grooper recognizes 268 distinct languages which can be individually enabled or disabled. Language detection interprets dates, times, currency names, numeric formats, and more.

Avoiding OCR on Electronic Text

Electronic Text

Grooper avoids optical character recognition altogether when dealing with original text-based files like Word, Excel, and Text PDFs. Instead, Grooper pulls complete and perfect text directly out of the file.

Give it a Try

The Grooper Experience Will Change You

Imagine With Us