How do you represent the word “Amsterdam” in a computer? How do you capture its semantics (Amsterdam is both a city and a capital)? And how do you make sure that London has a similar representation since it is also a city and a capital? Deep Learning is a novel Artificial Intelligence technique that attempts to answer these questions.
With Deep Learning, large amounts of text data are processed through algorithms to automatically learn representations of similar words. Textkernel has started expanding its ‘document understanding’ models (cv and vacancy parsing) to take advantage of the benefits of Deep Learning.
Using raw data to learn new knowledge
In the case of text, Deep Learning exploits the fact that similar words occur in similar contexts to infer the meaning of a word. For instance in a CV extraction system, the words “Amsterdam” and “London” tend to be used in addresses as the “city”. Deep Learning sifts through large amounts of data and produce word representations that cluster together these similar words. When a new word with representation similar to Amsterdam and London is found, it is likely to be a city. In this way, new knowledge can be inferred from raw data.
A representation of the name (red) and address (black) words from 4 CVs. The plot is a projection in 3D of the word representation inferred using Deep Learning. Note how first names and parts of British postcodes (e.g. 1XA) each tend to cluster together.
Increasing coverage and robustness
Deep Learning has allowed Textkernel to break free from the limitations of using human annotated data in its ‘machine learning’ pipeline. Adding new knowledge used to be a time consuming process. For example, a list of skills had to be manually gathered and then integrated in the pipeline. With Deep Learning this process can be automated and implemented in a more systematic fashion. This new knowledge increases the robustness of Textkernel’s document understanding models, makes them more responsive to new words and increases their domain coverage.
Extract! 4.0: better parsing with Deep Learning
In November 2017 Textkernel released the first ever English parsing software fully based on Deep Learning, Extract! 4.0. This release brought significant accuracy improvements to our parsing technology, with error rates decreasing by 15 to 30% on average! Ongoing development has brought Deep Learning to a range of new languages including German, French, Dutch, Spanish, Russian and simplified Chinese, some of which have seen error rates decrease by as much as 60%.
Get ready for a new era in HR Tech!