From humble beginnings in 2001 as a private, commercial R&D spin-offwith a focus on research into Natural Language Processing and Machine Learning at the Universities of Tilburg, Antwerp and Amsterdam.
We take pride in our strong ties to the academic community. And so with great excitement, our Head of Ontology, Panos Alexopoulos, has announced the early release of his book: “Semantic Modeling for Data – Avoiding Pitfalls and Breaking Dilemmas”. Published by O’Reilly, the book serves as a practical and pragmatic field guide for data practitioners that want to learn how semantic data modeling is applied in the real world.
We sat down with Panos to discuss a bit about how the concept of the book came about, and what readers can expect:
Avoiding Pitfalls and Breaking Dilemmas
“The book is about the broader topic of Semantic Data Modeling which is actually the task and problem of creating representations and structures of data in a way that the meaning of the data is explicit and commonly shared and understood by both systems and humans,” Panos explains.
“That’s a general challenge that information technology has, and especially now with AI technology in place, it’s important that meaning is understood in an explicit way by humans and machines. The book fills a gap in the literature and the market, especially when it comes to book about practitioners and professionals. There are several academic books describing how to build an ontology, what is the underlying theory behind data semantics etcetera, but the problem is usually this information is sparse, all around the place, either in papers or in presentations, so it’s never gathered together. What is lacking is the industry perspective, the perspective from the side of a practitioner – what it means to build, use and maintain these kinds of models in the real world, in organizations in the industry. My work here at Textkernel has been one of the key inspirations of the book, so many of the things I’ve seen here both positive and negative have contributed to me being a better modeler and professional, and I wanted to share these experiences with the rest of the community. That’s how the book was born.”
The role of early feedback
“It’s always important when you write a book to get early feedback, and what O’Reiley, my publisher, allows you to do is provide the book online, provide some raw and unedited content on he platform so that any users can see the book and are able to share their opinion, give their feedback, find mistakes, find things that may be wrong or may want more information. That’s extremely useful feedback because in the end it’s all about removing ambiguity. And because this book is not addressed to only one community, it’s actually a wider community and there are different sub-communities in the data world that don’t necessarily use the same terminology or necessarily have the same experiences it’s important that all these sub communities have a an opportunity to say something about the book.”
“Semantic Modeling for Data” is expected to be published in November, and is currently available as an early release version.
Download our latest eBook to understand why Internal Mobility is becoming an important new tool for your talent management strategy.
Learn more about how Internal Mobility can
- Reduce rising talent acquisition costs
- Improve employee satisfaction and engagement
- Build your internal skills economy that meet shifting talent demands
Download the full eBook to find out more:
In 2003, Textkernel started aggregating job vacancy information for matching and analytical purposes under the label ‘Jobfeed’ in The Netherlands. Today, Jobfeed is available for Austria, Belgium, Canada, Germany, France, Italy, the Netherlands, Spain, United Kingdom and US and Textkernel is market leader in this domain. Due to its strong technological base and domain knowledge, Textkernel created a unique source of job market data, allowing users to gain insight into the demand side of the labour market.
The unique aspects of Jobfeed:
- a very large number of sources (thousands of websites) that are spidered daily
- detailed enrichment on the job information that allows the use of many search criteria, regardless of the structure of the original vacancy text
- a high quality and reliable discovery and extraction process, resulting from years of experience
- accurate deduplication of job postings
- coding of professions and other criteria to customer-specific taxonomies
- customised reporting
- an unprecedented history of job data for analysis purposes and the capacity to make these jobs analysable with new insights
Jobfeed provides the ability to draw a near real-time picture of the labour market, and creates the opportunity to do trend analysis based on historic information from its large job database.
The Jobfeed process
Jobfeed searches the Internet daily for new jobs via an automated process. Found jobs are automatically extracted, categorised and recorded in the Jobfeed database. The following diagram shows a schematic representation of this process.
In more detail, the Jobfeed process consists of the following modules:
Jobfeed obtains new jobs from the Internet daily through spidering. In order to achieve broad and deep coverage, Jobfeed uses two spider methods: wild spidering and targeted spidering.
The wild spider is a system that works automatically and dynamically. It continuously indexes hundreds of thousands of relevant (company) websites and discovers new job postings.
Targeted spider scripts are created to retrieve jobs from specific – usually large – websites, like job boards, and websites of large employers. Despite their size and complexity, the script ensures that all jobs are found. The targeted spider scripts run multiple times per day.
Additionally, Jobfeed searches Twitter for links to jobs (currently only in The Netherlands).
Job sites that only copy or repost jobs from other sites (so-called aggregators), are excluded from Jobfeed, because Jobfeed already indexes the original jobs. Furthermore, aggregators often lose or misinterpret important information from the original job, resulting in bad quality.
Classification involves determining whether a retrieved web page contains a job or not. By means of advanced language technology and using textual features on the page, Textkernel’s algorithm determines whether the page should be processed. The classification is tuned to accept as many jobs as possible while discarding as many pages as possible that are not jobs.
Classification is only needed for pages coming from the wild spider, because the targeted spider scripts only fetch pages that are known to contain jobs.
In order to make the jobs searchable, they are automatically structured by means of Textkernel’s intelligent information extraction software. This software is trained on finding data in free text and is therefore independent of the structure of the text or format of the source.
The extraction process consists of two steps:
- Cleaning the web page, by removing all non-relevant content (such as menus and forms), leaving only the actual job text. In case of PDF input, this step does not apply.
- Extracting and validating more than 30 fields from the text, such as job title, location, education level and organisation.
Normalisation and enrichment
Normalisation means that extracted data is categorised according to a standard format. This makes it easier to search the data and perform analyses. Normalisation takes place on fields like professions, education levels and organisations.
For example, normalisation of professions is done by means of a taxonomy. This is a hierarchal structure that consists of reference professions with synonyms. The extracted position title is matched to one of the synonyms. The match does not need to be exact. The job will be linked to the most similar profession. When searching for jobs for a certain profession, all jobs matching any of the synonyms for that profession will be found.
Enrichment is done in case of organisations. The extracted contact information from the job is used to find the corresponding record in a national company database (such as Chamber of Commerce table in The Netherlands). Because the information in the job is usually sparse, a technique called “fuzzy matching” is used. Using this technique, Jobfeed can find the right organisation, regardless of differences in spelling of organisation name, address or in case of incomplete information. From the company database other information can be derived, such as the amount of employees, the company’s primary activity and its full contact information.
Jobs are often posted on multiple websites, or multiple times on the same website. Deduplication is done by comparing a new job with all jobs that have been found by Jobfeed in the past six weeks.
Two job postings that are duplicates of each other, are usually not identical. Deduplication therefore requires a sophisticated approach. As with the classification and extraction, the deduplication system uses a machine learning algorithm. To determine whether two jobs are duplicates of each other, the job description and important features of the job are compared, such as job title, city and advertiser.
Duplicates are not discarded, but also saved in Jobfeed. This way, Jobfeed is able to show not only how many unique jobs there are, but also how many job postings there have been.
Each job posting’s original source is regularly revisited to check whether it is still active. “Expired” means that the job is no longer directly available from the original URL, or that the job is no longer retrievable by a normal user from the homepage of the original website. The expiration date is stored in the Jobfeed database.
Automatic processes for spidering, extraction, classification and normalisation are the only cost-effective way to realise a maximum potential from online job data. However, these processes are not error-free. The quality of Jobfeed data is continuously monitored and improved. This is done using a combination of automatic alerting and manual quality checks.
For more information about Jobfeed, contact Textkernel via email@example.com.
Jobfeed is a product of Textkernel BV. Textkernel specialises in machine intelligence for people and jobs, providing recruiting tools to accelerate the process of matching demand and supply in the job market: multi-lingual resume parsing, job parsing and semantic searching, sourcing and matching software.
The company was founded in 2001 as a private commercial R&D spin-off of research in natural language processing and machine learning at the universities of Tilburg, Antwerp and Amsterdam. Textkernel now operates internationally as one of the market leaders in its segment.