Grooper Consultant Training: Unstructured Data Extraction Training

November 17 - November 19


Unstructured Data Extraction Training

Course Overview

Unstructured documents present a unique challenge for most document processing platforms.  These documents use natural language to convey information rather than placing data in table-like or geometric structures.  Contracts, for example, have clauses, party names, dates and other information throughout the text, which must be parsed through an understanding of the language around it.  While this may be easy for a human to do (depending on the complexity of the contract!), it is more difficult for a machine – even using a sophisticated algorithm – to understand the relationships between words in a paragraph.

Course Goals

This course aims to educate users on Grooper’s natural language processing capabilities to extract data from unstructured documents.  Grooper’s approach to natural language processing is two-fold.  1) User-assisted machine learning:  Understanding semantic importance of text features around text data by weighting them using the TF/IDF algorithm.  2) Text Structuring:  Applying paragraph detection and flow-based collation methods to data extraction methods in order to simulate how humans break up reading text in a document.

Key Concepts

  • Natural language processing using the Field Class data extractor
  • User-assisted machine learning with the TF/IDF algorithm
  • How to train your data: When to stop training
  • Grooper’s paragraph detection

Adjacent Knowledge

  • Dealing with bad OCR: FuzzyRegEx
  • Dealing with too many names, parties, and other unstructured information: Lexicon training

Practical Application

  • Students will create data models to create structure from unstructured data
  • Students will demonstrate an understanding of the TF/IDF algorithm by training document sets in order to extract data via the Field Class extractor
  • Students will workshop simple to complex document sets across multiple industries to successfully target and extract their data elements


November 17
November 19
Event Category:


Remote online instructor-led training

