It all Starts with Image Quality

Before any OCR action takes place, you’ll want to make sure you’re handing the OCR activity an image that is straight and free of artifacts. The key is to remove everything from the page that isn’t text.

Grooper lets you process images through a growing arsenal of exclusive tools and out-of-the-box profiles specifically designed for this task. The best part is these tools won’t alter the original version of the image you want to permanently retain.

Examples

  • Remove lines
  • Ensure edges are clean
  • Remove small specks
  • Remove large non-text objects
  • Invert white-on-black zones
  • Remove hole punches

Multi-Pass OCR

No matter how clean and pristine your images may appear, outdated OCR engines still have a difficult time collecting accurate text from images with multiple columns, different font sizes, and image shear. Grooper’s patented OCR synthesis engine intelligently performs multiple passes of OCR on different portions of the image and Groops the results together as a single unit, keeping only the most accurate text results.

And When All Else Fails,
We’ve Got Spell Correction

Powered by our Atomic RegEx engine, Grooper can perform OCR correction to fix some pretty ugly stuff.

Examples

  • Correct simple OCR mistakes in strings that don’t match words in a language of your choice.
  • Fix existing, human-generated typos on documents.
  • Re-insert spaces where OCR falsely jammed multiple words together.
  • Delete strings of non alpha-numeric characters that resemble somebody’s attempt at censorship, like “$#@! ^&*”.
  • Repair numeric values where overly-aggressive image cleanup has inadvertently removed commas and periods.

Performance Balancing

Grooper’s “Run Speed” option gives you control to achieve an ideal balance between accuracy and performance.

Language Support

Grooper recognizes 268 distinct languages which can be individually enabled or disabled. Language detection interprets dates, times, currency names, numeric formats, and more.

Electronic Text

Grooper avoids OCR altogether when dealing with original text-based files like Word, Excel, and Text PDFs. Instead, Grooper pulls complete and perfect text directly out of the file.

Intelligent PDF Text Extraction

PDF has become the most widely used document standard in the world. With that adoption comes a variety of challenges you’ll have to face in order to get the best text from every page. Some PDFs are purely text-based, others just images re-packaged into a PDF format, and yet others have combinations of the two scattered throughout pages.

A Hybrid Approach

  • Grooper examines each page within a PDF to place the page into one of three categories: image-based, text-based, or mixed-content. Then each page is handled accordingly.
  • If a PDF page contains a single image which covers the entire page, it is considered an image-based page, and is processed using OCR.
  • If a PDF contains no images, we extract only the raw text-behind the page.
  • For mixed-content pages, each image on the page is extracted to a temporary image. Each temporary image is processed through OCR. Then the OCR results are merged with the native text.

Give it a Try

The Grooper Experience Will Change You

Imagine With Us