Near-Perfect OCR Shouldn’t be so Difficult

The secret to high accuracy is removing all non-text elements.

Grooper does this through industry-first image processing tools and out-of-the-box configurations specifically designed for the task.

The best part is these tools won’t alter the original version of the image you will permanently retain.

What’s the best OCR software? The best is actually a combination. Use your own, or Grooper’s (or both).

OCR Line Removal Animation

How to Prep a Document Image for Optical Character Recognition

  • Remove lines
  • Clean up document edges
  • Remove small specks
  • Remove large non-text objects
  • Invert white-on-black zones
  • Remove hole punches

Tips for More Accurate OCR

No matter how clean and pristine document images appear, text recognition software still struggles to collect accurate text. Text in images, in multiple columns, and different font sizes all contribute to bad character recognition.

Another cause of inaccurate capture is that OCR engines process whole pages, from top to bottom. A better approach is OCRing select areas of a page and combining the results together.

Grooper’s patented OCR synthesis engine intelligently performs multiple passes on different portions of a document image. Results are grouped together as a single unit, providing highly accurate text results.

In a lab test, Grooper accurately captured 99.91% of text. Using OCR alone on the same data set proved half as accurate.

5 Features to Guarantee Accurate OCR

Document Segment Reprocessing Animation

Intelligent Spell Correction

Powered by Atomic RegEx, Grooper performs corrections to fix some pretty ugly stuff. And the secret to making this work? K-Means Clustering, text removal, and text correction engines.

Spell Correction

What Spelling Errors Does Grooper Correct?

  • Simple OCR mistakes in strings that don’t match words in a standard dictionary
  • Human-generated typos on documents
  • Insert spaces where OCR falsely jammed multiple words together
  • Delete strings of non alpha-numeric characters that resemble an attempt at censorship, like “$#@! ^&*”
  • Repair numeric values where overly-aggressive image cleanup inadvertently removed punctuation

How to OCR a PDF Document

PDF is the most widely used document standard in the world. Because there’s no standard for generating a PDF, capturing text has varying levels of difficulty:

  • Some PDFs are purely text-based (easy to OCR)
  • Others are just document scans in PDF format (difficult)
  • Others PDFs have combinations of the two scattered throughout pages (most difficult)

PDF documents have a fair amount of text recognition challenges.

Getting Text off a PDF Animation

How to Get Text off of PDFs:

Grooper looks at each page within a PDF and places the page into one of three categories: image-based, text-based, or mixed-content. By doing this automatically, specific rules and processing techniques make text extraction easier.

Each page is handled accordingly:

  • Process PDF pages containing a single image covering the entire page as image-based pages
  • If a PDF contains no images, extract only the raw text-behind the page
  • For mixed-content pages, extract each image to a temporary image, process the image, and merge the results with the native text

Additional Tools

OCR Performance Balancing

Performance Balancing

Grooper’s “Run Speed” option provides control to achieve the ideal balance between accuracy and performance.

Multi-Language Text Support

Language Support

Grooper recognizes 268 distinct languages and 523 regional cultures. Language detection interprets dates, times, currency names, numeric formats, and more.

Avoiding OCR on Electronic Text

Electronic Text

Grooper avoids optical character recognition altogether when dealing with original text-based files like Word, Excel, and Text PDFs. Instead, Grooper pulls complete and perfect text directly from the file.

Give it a Try

The Grooper Experience Will Change You

Imagine With Us