By Karlijn Dinnissen, Research Engineer in Textkernel’s R&D team and Chao Li, Team Lead in Textkernel’s R&D team
If you are using Jobfeed for lead generation or labor market analytics, chances are that it matters to you which type of organizations are posting the jobs. Therefore, Jobfeed distinguishes between two advertiser types: direct employers and staffing/recruitment agencies.
The challenge is that the advertiser type is typically implicit in a job description. Best case scenario is the advertiser literally mentions “we are a staffing agency”, but then there are still countless ways to phrase this same thing. Therefore, we need to infer the advertiser type ourselves.
We developed a new multi-step Deep Learning AI system that first classifies whether a job posting comes from a direct employer or staffing agency. Then it uses all posting-level classifications to infer whether an organization is a direct employer or agency. This approach results in much better accuracy but also consistency for the postings coming from the same organization. Read on to learn how we did it.
Our previous method
Because we knew how important it is to know the advertiser type for each company, we started building a knowledge base of organizations and their types since the very start of Jobfeed. In the beginning, this was all done manually by reading the job descriptions or researching the advertiser.
But the bigger Jobfeed grew, the more new organizations we found. Therefore, maintaining the knowledge manually simply became unsustainable. Straightforward logic was added to get automatic ‘staffing agency’ signals from job postings, of which the most effective was pattern matching: staffing agencies typically use very similar ways to describe themselves and the job they are posting.
For example, these type of phrases may look very familiar to you:
- “For one of our clients, we are looking for a …”
- “A big player in the industry is recruiting a …”
- “We act as an employment agency for permanent recruitment for …”
And if you see these organization names, what do you think their advertiser type is?
- People Recruitment
- Staffing 123
- Amsterdam Resourcing
To our knowledge, our competition uses a similar approach. The quality achieved is good but for such an important field, good is not enough. Every mis-tagged advertiser can be a major annoyance for users. We needed a better, more scalable solution to this problem.
Deep Learning for text classification
Textkernel has nearly 20 years of experience of applying the state-of-the-art in Machine Learning to the Recruitment domain. Therefore it made sense to also apply our expertise to classify advertisers automatically.
As a job always comes from one of the two advertiser types, this means we are dealing with a binary text classification task, which is a common task in the Natural Language Processing (NLP) research field. It lends itself especially well for applying a Deep Learning classification model.
Deep Learning is an advanced technique that can automatically discover patterns from large amounts of data. We have used it successfully in our Extract parsing models since 2017 (more details on the approach here) and it has resulted in great performance improvements. It was time to apply it to a new problem. In our case, this means that instead of thinking of good patterns that indicate a certain type of advertiser ourselves, we can let a Deep Learning classifier do the job for us. This model is very likely to find a lot more useful patterns, even patterns that a human would never think of.
In order to come up with all those patterns, the model needs to see as many examples of job texts from both advertiser types as possible (supervised learning). Normally this can be a big bottleneck: every text you use to train the model needs to have a label (“staffing agency” or “direct employer”), and you can imagine that manually annotating 100,000 job postings this way would take a long time. But in our case we had our big knowledge base and historical data: we could use all Jobfeed jobs from every organization we had categorised throughout the years as training data. No need to annotate anything!
A typical Machine Learning model training process (blue) and prediction process (green)
We started by training a CNN (Convolutional Neural Network) classifier using English job postings from multiple countries. To make sure the new system is future proof, we evaluated its quality by manually checking the classifier’s output on a sample of jobs from organizations that did not yet exist in our knowledge base. Compared with the old system, our new method found significantly more staffing agencies compared to the old rule-based system. This means that we are able to identify many new staffing agency postings within the long tail of organizations for which we had no prior manual classification.
This convinced us that Deep Learning was the correct path towards solving this problem, so we invested more time into optimizing our training process, collecting more data from more organizations and Jobfeed countries, optimizing the model’s hyperparameters, and finally we also trained models for all other Jobfeed languages (Dutch, German, French, Italian and Spanish).
Ensuring consistency: a second Deep Learning model
Once we enabled our new classifier in Jobfeed, there was already a big increase in advertiser type quality throughout all countries. There was, however, one caveat: not necessarily all job postings from an organization contain the same type of signals. Therefore there is a chance that certain organizations will have 90-95% of their jobs classified as one of the advertised types and 5-10% as the other.
We wanted to make sure all jobs from an organization would have the same advertiser type, to keep our data consistent. The most logical solution was to use the posting-level classifications to infer new knowledge on organization level.
We created a process that regularly aggregates the job postings from one organization, and uses the individual advertiser type classifications to infer whether the organization is a direct employer or staffing agency. If the final prediction is certain enough, we can even update our knowledge base automatically! The threshold we use for ‘certain enough’ can be different per language model and therefore country, which we kept in mind while designing the process.
A naive approach was to simply add up counts per organization and take the most frequent advertiser type (e.g. 20 staffing agency postings & 5 direct employer postings ⇒ staffing agency). However, this did not give us the accuracy and yield that we needed. Therefore, we created another Deep Learning model that makes the final decision based on the output from the first one.
Its input consists of statistical features derived from predictions on all postings from one organization. In addition, we also used the organization name and the website on which the job was posted. An added benefit of using mostly statistical features was that we could train a language-agnostic model which can be applied to any organization’s posting-level output, regardless of country or language.
We again trained the model in a supervised way (using labeled training data) and evaluated it on data from unknown organizations. The results showed us that our classifier automatically identifies with high confidence between 55% and 85% of the new staffing agencies (depending on the country). Since our system runs at regular intervals to detect new staffing agencies, we noticed that its performance and confidence increases as new postings arrive from the yet unclassified staffing agencies. The more data our system sees, the better it gets.
As a result, since enabling the classifier we have seen a 20-50% increase in staffing agencies in Jobfeed.
While initially we focused on identifying staffing agencies, we found that we could kill two birds with one stone. Not only could we use very high confidence staffing agency predictions to automatically identify agencies, but we could use very high confidence predictions to identify the opposite category: the direct employers. Therefore we added automatic direct employer detection as well, further increasing our Jobfeed data quality and consistency.
This solution has already allowed us to identify with high confidence the advertiser type of over 55,000 new organizations across all countries. Since our process is organized as an iterative self-feeding loop, many more are being added continuously and automatically.
We are very excited that we have solved a major customer pain point by applying our experience in AI and leveraging our own data and knowledge. It also opens up new possibilities to improve other aspects of the data in Jobfeed. Our continued investment in the Jobfeed data will ensure you keep saving time and stay ahead of the competition.
To learn more about how companies benefit from Textkernel’s labor market intelligence, Jobfeed, please visit our Jobfeed pages.