How do you represent the word “Amsterdam” in a computer? How do you capture its semantics (Amsterdam is both a city and a capital)? And how do you make sure that London has a similar representation since it is also a city and a capital? Deep Learning is a novel technique that attempts to answer these questions.
With Deep Learning, large amounts of text data are processed through algorithms to automatically learn representations of similar words. Textkernel has started expanding its ‘document understanding’ models (cv and vacancy parsing) to take advantage of the benefits of Deep Learning.
Using raw data to learn new knowledge
In the case of text, Deep Learning exploits the fact that similar words occur in similar contexts to infer the meaning of a word. For instance in a CV extraction system, the words “Amsterdam” and “London” tend to be used in addresses as the “city”. Deep Learning sifts through large amounts of data and produce word representations that cluster together these similar words. When a new word with representation similar to Amsterdam and London is found, it is likely to be a city. In this way, new knowledge can be inferred from raw data.
A representation of the name (red) and address (black) words from 4 CVs. The plot is a projection in 3D of the word representation inferred using Deep Learning. Note how first names and parts of British postcodes (e.g. 1XA) each tend to cluster together.
Increasing coverage and robustness
Deep Learning has allowed Textkernel to break free from the limitations of using human annotated data in its ‘machine learning’ pipeline. Adding new knowledge used to be a time consuming process. For example, a list of skills had to be manually gathered and then integrated in the pipeline. With Deep Learning this process can be automated and implemented in a more systematic fashion. This new knowledge increases the robustness of Textkernel’s document understanding models, makes them more responsive to new words and increases their domain coverage.
Just like Google and Facebook, Textkernel is learning how to best leverage this new technique. The first results already show significant improvements although the Deep Learning R&D efforts are at an initial phase.
Expect even better parsing in the future!