Document understanding

The key technology in our products is document understanding, also known as information extraction, or document parsing. Basically, this is the capability to make sense of the free format language in unstructured documents. Advanced statistical and rule based natural language processing techniques are combined to get the best results. This allows our systems to recognize patterns, concepts and their relationships in text in order to fill the fields of a structured representation.

The details

This structured representation enables us to build applications around the semantics (i.e. meaning) of the document, rather than just around the keywords present in that text. Structured information gives better insights, faster action, and more relevant matches. Full text documents hence become much more valuable for your organisation, without the need for laborious data entry.

Our software does not rely on fixed formats to extract key information from documents.

Why is this difficult? This is inherent in the endless expressivity of human language. The same meaning can be expressed with a variety of words and many different sentence structures. Words are ambiguous too. The same word can mean many different things based on context. Java could be location, part of a company name or a computer skill in different context. Spotting the concepts in a document is one thing, another thing is to determine if the right relation exists between them. Was someone the CFO of a company, an assistant to the CFO, or reporting to the CFO? Do you have 15 years of experience as a programmer, or 6 months? Huge differences. And even layout and formatting of documents by individual writers introduce a lot of variation.

These complexities of natural language make it very difficult to build systems processing unstructured information in the same way that most software is built, namely by programming sets of rules. One approach has shown itself to be the most successful, the machine learning approach. This allows software to learn the knowledge of language structure, variation and meaning from large sets of data.