Grooper enhances images displayed to users throughout your workflow, removes image artifacts known to interfere with OCR, provides crisp versions for permanent archival, and analyzes page structure to assist with automated decision making downstream.
Image Processing – the secret for achieving near-perfect OCR
Good OCR (optical character recognition) starts with images free of non-text artifacts. Grooper has dozens of features that remove everything that isn’t text to ensure you get distraction-free OCR. Let’s look at a few examples.
Safe and Clean Halftone Removal
Dithering and other halftone patterns are a direct result of legacy document imaging platforms poorly converting color images to black and white. These artifacts must be eliminated to prevent massive errors in OCR results, particularly with punctuation like periods and commas.
Halftone artifacts completely surround text we’d like to capture. OCR stands very little chance at seeing these characters.
Grooper recognizes dithered patterns and safely removes them without eliminating legitimate punctuation that is close to letters on the page.
Seriously Brilliant Border Removal
Borders have historically been very tricky to remove when the black region doesn’t extend all the way to the edge of the page.
Grooper understands how to address a variety of uncommon border scenarios to cleanly remove them.
Photoshop-Like Inpaint Removal
You work with full-color documents every day. Why shouldn’t your image processing do the same?
Our object removal strategies break out of the black and white color space and allow you to remove artifacts from color images like they were never there.
Pixel-Perfect Line Detection & Removal
Lines are used all throughout standardized forms, table structures, and pages with “fill-in-the-blank” comb boxes to provide visual cues that increase legibility for readers.
These lines, particularly the short, vertical ones, are commonly picked up by OCR Engines as characters. Grooper can erase these with ease.
- Dropout is performed using a very precise, pixel-by-pixel mask of lines rather than a generic point-to-point/thickness approach. This technique leaves no edge artifacts behind, providing a cleaner OCR image.
- Works well with very short lines, even those smaller than many of the letters on your page. Grooper knows the difference between lines that should be removed and characters like “l, I, and 1”.
- Characters connected to lines are detected and preserved, keeping valuable data for your OCR process.