Textkernel uses a combination of many state of the art machine learning techniques in building accurate document understanding models. The reason machine learning is so successful is that large sets of example data contain sufficient numbers of patterns that frequently occur in new data. Despite the variation in documents, they tend to be quite similar. So the system learns to generalize in such a way that it maximises the chances of success on any type of new document.
Machine-learning as an art
Our approach is to learn to make many small decisions, about words in their context. These small classification decisions can take into account many types of features: words, phrases, context, layout, knowledge of the domain and consistency across the whole document. If similar patterns of features have been seen in the training data, the software will be able to make a sensible (similar) interpretation of the concepts in the document. Since this process tries to mirror human understanding, we refer to it as Document Understanding, to differentiate it from character recognition or fixed format data extraction.
The multitude of small decisions still gives many possibilities for the meaning of the whole document. The powerful algorithms in our software combine many of the small patterns into a coherent interpretation of the document by reasoning with probabilities.
Probabilistic models are a very powerful approach, because it gives our engineers the possibility to model many types of knowledge that interacts to form the best decisions.
Is it possible to obtain perfect understanding? No, any automatic document understanding system will still make quite some mistakes when compared to humans. Nonetheless, the machine learning approach currently offers the most efficient and accurate systems at the lowest cost.
Machine learning is not magic, but it is an art, as it requires a lot of expertise about applying algorithms in the best way, and engineering the right type of features and knowledge representations. Also, the field of machine learning for language processing is still advancing and evolving rapidly. Textkernel has a large international group of top experts, and is keeping in close contact with the academic research community in this area.
This approach allows Textkernel to build cost effective document understanding systems for global customers. Because each language requires specific knowledge acquisition for a domain. And with our roots firmly in Europe, and in machine learning, we are able to deal with multi-linguality, which is the key issue in a global recruitment technology market.