How Jobfeed by Textkernel works
In 2003, Textkernel started aggregating job vacancy information for matching and analytical purposes under the label ‘Jobfeed’ in The Netherlands. Today, Textkernel is market leader in this domain in the Netherlands and has been expanding Jobfeed into other countries, such as Germany, France and the US. Due to its strong technological base and domain knowledge, Textkernel created a unique source of job market data, allowing users to gain insight into the demand side of the labour market.
The unique aspects of Jobfeed are:
- a very large number of sources (thousands of websites) that are spidered daily
- detailed enrichment on the job information that allows the use of many search criteria, regardless of the structure of the original vacancy text
- a high quality and reliable discovery and extraction process, resulting from years of experience
- accurate deduplication of job postings
- coding of professions and other criteria to customer-specific taxonomies
- customised reporting
- an unprecedented history of job data for analysis purposes and the capacity to make these jobs analysable with new insights
Jobfeed provides the ability to draw a near real-time picture of the labour market, and creates the opportunity to do trend analysis based on historic information from its large job database.
The Jobfeed process
Jobfeed searches the Internet daily for new jobs via an automated process. Found jobs are automatically extracted, categorised and recorded in the Jobfeed database. The following diagram shows a schematic representation of this process.
In more detail, the Jobfeed process consists of the following modules:
Jobfeed obtains new jobs from the Internet daily through spidering. In order to achieve broad and deep coverage, Jobfeed uses two spider methods: wild spidering and targeted spidering.
The wild spider is a system that works automatically and dynamically. It continuously indexes hundreds of thousands of relevant (company) websites and discovers new job postings.
Targeted spider scripts are created to retrieve jobs from specific – usually large – websites, like job boards, and websites of large employers. Despite their size and complexity, the script ensures that all jobs are found. The targeted spider scripts run multiple times per day.
Additionally, Jobfeed searches Twitter for links to jobs (currently only in The Netherlands).
Job sites that only copy or repost jobs from other sites (so-called aggregators), are excluded from Jobfeed, because Jobfeed already indexes the original jobs. Furthermore, aggregators often lose or misinterpret important information from the original job, resulting in bad quality.
Classification involves determining whether a retrieved web page contains a job or not. By means of advanced language technology and using textual features on the page, Textkernel’s algorithm determines whether the page should be processed. The classification is tuned to accept as many jobs as possible while discarding as many pages as possible that are not jobs.
Classification is only needed for pages coming from the wild spider, because the targeted spider scripts only fetch pages that are known to contain jobs.
In order to make the jobs searchable, they are automatically structured by means of Textkernel’s intelligent information extraction software. This software is trained on finding data in free text and is therefore independent of the structure of the text or format of the source.
The extraction process consists of two steps:
- Cleaning the web page, by removing all non-relevant content (such as menus and forms), leaving only the actual job text. In case of PDF input, this step does not apply.
- Extracting and validating more than 30 fields from the text, such as job title, location, education level and organisation.
Normalisation and enrichment
Normalisation means that extracted data is categorised according to a standard format. This makes it easier to search the data and perform analyses. Normalisation takes place on fields like professions, education levels and organisations.
For example, normalisation of professions is done by means of a taxonomy. This is a hierarchal structure that consists of reference professions with synonyms. The extracted position title is matched to one of the synonyms. The match does not need to be exact. The job will be linked to the most similar profession. When searching for jobs for a certain profession, all jobs matching any of the synonyms for that profession will be found.
Enrichment is done in case of organisations. The extracted contact information from the job is used to find the corresponding record in a national company database (such as Chamber of Commerce table in The Netherlands). Because the information in the job is usually sparse, a technique called “fuzzy matching” is used. Using this technique, Jobfeed can find the right organisation, regardless of differences in spelling of organisation name, address or in case of incomplete information. From the company database other information can be derived, such as the amount of employees, the company’s primary activity and its full contact information.
Jobs are often posted on multiple websites, or multiple times on the same website. Deduplication is done by comparing a new job with all jobs that have been found by Jobfeed in the past six weeks.
Two job postings that are duplicates of each other, are usually not identical. Deduplication therefore requires a sophisticated approach. As with the classification and extraction, the deduplication system uses a machine learning algorithm. To determine whether two jobs are duplicates of each other, the job description and important features of the job are compared, such as job title, city and advertiser.
Duplicates are not discarded, but also saved in Jobfeed. This way, Jobfeed is able to show not only how many unique jobs there are, but also how many job postings there have been.
Each job posting’s original source is regularly revisited to check whether it is still active. “Expired” means that the job is no longer directly available from the original URL, or that the job is no longer retrievable by a normal user from the homepage of the original website. The expiration date is stored in the Jobfeed database.
Automatic processes for spidering, extraction, classification and normalisation are the only cost-effective way to realise a maximum potential from online job data. However, these processes are not error-free. The quality of Jobfeed data is continuously monitored and improved. This is done using a combination of automatic alerting and manual quality checks.
For more information about Jobfeed, contact Textkernel via firstname.lastname@example.org.
Jobfeed is a product of Textkernel BV. Textkernel specialises in machine intelligence for people and jobs, providing recruiting tools to accelerate the process of matching demand and supply in the job market: multi-lingual resume parsing, job parsing and semantic searching, sourcing and matching software.
The company was founded in 2001 as a private commercial R&D spin-off of research in natural language processing and machine learning at the universities of Tilburg, Antwerp and Amsterdam. Textkernel now operates internationally as one of the market leaders in its segment.