Grooper Consultant Training: Unstructured Data Extraction Training
November 17 - November 19$5,000
Unstructured Data Extraction Training
Unstructured documents present a unique challenge for most document processing platforms. These documents use natural language to convey information rather than placing data in table-like or geometric structures. Contracts, for example, have clauses, party names, dates and other information throughout the text, which must be parsed through an understanding of the language around it. While this may be easy for a human to do (depending on the complexity of the contract!), it is more difficult for a machine – even using a sophisticated algorithm – to understand the relationships between words in a paragraph.
This course aims to educate users on Grooper’s natural language processing capabilities to extract data from unstructured documents. Grooper’s approach to natural language processing is two-fold. 1) User-assisted machine learning: Understanding semantic importance of text features around text data by weighting them using the TF/IDF algorithm. 2) Text Structuring: Applying paragraph detection and flow-based collation methods to data extraction methods in order to simulate how humans break up reading text in a document.
- Natural language processing using the Field Class data extractor
- User-assisted machine learning with the TF/IDF algorithm
- How to train your data: When to stop training
- Grooper’s paragraph detection
- Dealing with bad OCR: FuzzyRegEx
- Dealing with too many names, parties, and other unstructured information: Lexicon training
- Students will create data models to create structure from unstructured data
- Students will demonstrate an understanding of the TF/IDF algorithm by training document sets in order to extract data via the Field Class extractor
- Students will workshop simple to complex document sets across multiple industries to successfully target and extract their data elements