Posted on April 1, 2019

Sequence Labeling via Deep Learning – The magic behind Extract! 4.0

by Chao Li

The wave of Deep Learning

Textkernel has championed the use of Machine Learning to connect people to jobs faster and easier.  The fundamental components of this technology are our resume and vacancy parsing models, which extract the most important information from unstructured text. Recently we have achieved remarkable improvement by applying Deep Learning to existing models.

Deep Learning is the key technology powering many exciting novel applications, ranging from Google translate to Siri and Alexa, and from Google Photos to Amazon GO stores. The current wave of Deep Learning is largely fuelled by the growth of computing power and recent advances in neural network algorithms. Thanks to this progress, problems that were exclusively owned by human beings, e.g. speech recognition, machine translation, game playing, even car driving, can now be solved better by computers.

Since late 2017, Deep Learning is also the technology powering Textkernel’s Extract 4.0. In our new generation of resume and vacancy parsing product, the traditional statistical parsing models are replaced with Deep Learning neural network models. The new models have achieved remarkable improvements over different languages, brought better generalization to new data and new domains, and reduced the need for manual feature engineering.

The advantage of Deep Learning for sequence labeling

Accurate sequence labeling is the foundation for all Textkernel products, from Extract! and Jobfeed to Search! and Match!. Back in the days before the Deep Learning revolution, which is not a very long time ago, Hidden Markov Model (HMM) and Conditional Random Fields (CRF) were the best models for this problem.  Both methods take a sequence of input instance and learn to predict an optimal sequence of labels. The differences are that HMM simply works on the word type of tokens, while CRF works on a set of features defined on input token or phrases.

They are very powerful models but have not yet experienced great success due to a few drawbacks. One of the most well-known weaknesses of the HMM and CRF is their lack of semantic awareness. For example, they take “Madam” and “Rotterdam” equally similar or distant to “Amsterdam”.

A direct consequence of this is their difficulty is too generalized to unseen data. Since the model is very often trained on a limited number of annotated documents, the performance over different domains may vary a lot depending on whether the domain is covered by the training data set.  For example, a typical training set contains a limited set of common job titles or company, and the prediction is much more difficult when parsing resumes from a particular sector not presented in the training set, e.g. offshore oil industry.

If one is from the Machine Learning background, you may say, “I know where you are going, word embeddings?”. Yes, word embedding is the dense representation of the word and a much more semantically rich representation of the word. In the clip below, we show a word embedding trained on many documents where words from similar concepts,  job titles, cities, skills are clustered together in space.

Although there are works (including our work before the Deep Learning era) experimenting CRF with embeddings as feature set, Neural network models work directly on the word embedding, and generalize better on unseen data, i.e., new job titles can be identified even if they did not appear in the training data.

Another drawback is that both HMM and CRF are Markov chain-based model. They have trouble handling longer sequential dependency, due to their Markovian assumptions, e.g., dependencies of the input sequence longer than 3 steps or larger are often ignored.  On the other hand, RNN (Recurrent Neural Network) models are designed to capturing local dependencies and finding longer patterns. LSTM, GRU, and other RNN variations with gate units are the improved version of the standard recurrent units. Their gated units allow the network to pass or block information from one time step to the other, able to keep information around for even longer sequence. The gated units also reduce the effect of the vanishing gradient problem for long input sequence during training.

A large part of CRF model development is finding a set of high-quality features. Machine learning engineers are needed for handcrafting and tuning features. For example, adding a list of typical section headers can largely improve the model performance. The dependence on the handcrafted features poses a very difficult problem when expanding CV parsing service to new languages.

In contrast, deep neural nets have shown great power to learn latent features. The training processes is a joined learning of finding the most representative features and train the best model given these features. This is crucial for model development: it dramatically decreased the development time by saving the work to handcrafted features.

Textkernel’s Deep Learning model

After playing with different RNN units, the popular LSTM units are chosen to build our sequence labeling network. The input of the model is a sequence of tokens, and they will go through the following layers and components:

  • It starts with a one-hot layer to look up the word in their embedding space, turns the word into a dense vector and feeds it into a multi-layer Bidirectional LSTM model.
  • The Bidirectional LSTM is actually two separate neural network layers, one feeds the data from beginning to the end and one from the end to the beginning. Both layers generate a representation vector for every step, and the two vectors are joined to create a better representation of context.
  • After one or more Bidirectional LSTM layers, the probabilities of possible classes for each input entity is computed using a softmax layer.
  • To have better consistency in the predicted tag sequence, the softmax probabilities are combined with the transition probabilities from a linear CRF layer. In other words, instead of predicting each label independently, the CRF layer takes into account the labels of surrounding words as well.
  • At the last layer, using the vector representing from both the softmax layer and CRF layer, a prediction is made.

The model is designed to be flexible to work in different sequence labeling problems. E.g. when the working entity is a phrase, i.e. sentences or lines, the model is able to generate a phrase representation to feed to the network and label a sequence of phrases. In this case, a Convolutional Neural Network (CNN) is applied to combine embeddings of all tokens into one.

Word embedding is normally trained on a very large corpus and leaves only a very small portion of infrequent words as unknown to the model. However, the presentation of unknown words increases dramatically for morphologically rich languages and compounding languages and could have a big impact on the model’s performance. The subword embedding is used for these languages. Like human beings guess the meaning of the word from its stem or subunit when they learn it for the first time. In the subword embedding, the embeddings for a word are represented by the sum of all embeddings of its sub-units. Subword embeddings eliminate all unknown words and bring substantial improvement to those languages.

Like the human brain needs experiences to learn and deduce information, the deep neural network model may contain from hundreds to billions of parameters, which need to be learned progressively from the training data. The training process includes both determining the model complexity (number of layers, size of hidden units, etc.), and finding the best set of model parameters (weights, bias) given the complexity of the model. Our R&D team has built a toolset to prepare train data, train and evaluate models, and integrate the best models back to the parsing pipeline. Allowing us to easily conduct experiments on new dataset and languages, and quickly roll out new models or better models.

Deep Learning result

The Deep Learning model boosted the model accuracy with a big leap, easily exceeded the performance of the old statistical models. Below we show the comparison for the English resume parsing model, to the new Deep Learning model as well as the old statistical based models. The Deep Learning model shows a 10~20% relative improvement over education fields, and 15~35% over experience fields. The same trends are observed when applied the Deep Learning to other languages.

With this result, Deep Learning again proved that it is an incredibly powerful set of techniques for NLP problems. For Textkernel, a new door has opened. We have lots of ideas on how to further improve our parsing models and many new directions to explore. More research work and exciting improvements are on the way. Stay tuned!