For Jobfeed, Textkernel’s real-time database for labor market analysis, we crawl the web for online job ads. Of course, ensuring the high quality of this database raises many technical challenges. One of them is the detecting duplicates of job ads. In this blog post Valentin Jijkoun, Web Mining team lead, explains how Textkernel solves this problem.
By Valentin Jijkoun
The average online job posting has many duplicates
When you work with large volumes of web data, sooner or later you encounter the problem of duplicate content. Many web pages are not unique: a Google paper suggests that around 30% of web pages are near-duplicates.
In Jobfeed, where we spider online job ads in eight countries, this issue is even more prominent: we found out that an average job ad is reposted 2 to 5 times (depending on the country), which makes the fraction of duplicates as high as 50–80%. This is not surprising: companies advertise their jobs as widely as possible to attract more good applicants: on corporate sites, on general and specialised job boards, via staffing agencies, etc. This makes identifying and grouping duplicate ads an essential step in our data processing.
Near-duplicate jobs ads are rarely identical: when posting an ad on a different site, text is often restructured, shortened or extended and metadata is often changed to match the site’s data model. Consider two postings of the the same job shown below:
The statement about equal opportunity employment from aerotek.com is moved to the top of the posting on thingamajob.com. The metadata has different fields (“Company:”, “Category:”) and the values are written out differently. The full address of the company is added to the posting on thingamajob.com. The boilerplate text (page headers, footers, navigation blocks) is website-specific as well.
Nevertheless, how can we tell these two ads refer to the same job?
- Aerotek (a staffing agency) mentions “Posting ID” which happens to be the same as the “Job reference ID” on thingamajob.com (a job board).
- The job titles are identical and the job descriptions are very similar.
- The name of the job advertiser is similar: Aerotek vs. Aerotek Scientific.
- Similar contract type, salary range and job location.
- The same contact person (the name, the phone number and the email).
These observations make it very likely we are dealing with the same vacancy. However, none of the observations on its own would be sufficient. Consider two identical job ads where only the location of the job (e.g., the city name, a single word!) is different: these ads are clearly not duplicates and should be treated as separate vacancies.
In Jobfeed, we approach this as a machine learning problem: we train a statistical classifier that predicts whether two postings refer to the same job based on a number of features (such as the “observations” above).
For a large country like the USA, we spider hundreds of thousands new job postings every day. Since it would be impractical to compare every new posting to each of the millions postings we have seen before, we restrict the comparison to a subset of textually similar ads. Identifying this subset is where traditional textual near-duplicate detection turns out to be very useful.
Efficient methods for finding similar web pages are almost as old as the web itself: a seminal paper Syntactic clustering of the web by A. Broder et al. appeared in 1997. In Jobfeed we use a classic approach based on shingling and min-wise permutations.
As we show below, a relatively simple implementation can easily deal with tens of millions of documents. For datasets with billions of documents, a more sophisticated implementation and/or a different method is needed: see the blog post Near-Duplicate Detection for an overview and the paper Detecting Near-Duplicates for Web Crawling for details. With shingling, we convert every text document into a set of all sequences (shingles) of consecutive words as they occur in the document.
For example, if we consider 6-word sequences for the text “Well established and respected Law Office in Downtown Bakersfield is in need of a temporary Legal Assistant …”, the shingles will include:
- “Well established and respected Law Office”
- “established and respected Law Office in”
- “and respected Law Office in Downtown” etc.
We can measure textual similarity between two documents as the overlap between their sets of shingles. For our example “Legal secretary” ads, the two documents share 179 out of total 484 shingles between them, which gives us the similarity of 37%. It may seem low, given that the job descriptions are nearly identical: it is mainly due to the different formatting of metadata (location, company name, phone) and missing text on equal opportunity employment. It is clear that for job ads, using a simple similarity threshold to decide whether two ads are duplicates would not be sufficient.
Finding similar documents efficiently
To avoid storing and comparing hundreds of shingles for every document, we apply a locality sensitive hashing scheme as follows. We first map each shingle of a document to a 64-bit integer using a good hash function. Let’s call these integers shingle hashes. We take a fixed random integer permutation, apply it to all shingle hashes from the document and take the smallest value – this is the first value we will store. We repeat this M times with different fixed random permutations, to get a list of M integers: we call this the sketch of the document. It turns out that in order to estimate similarity of two documents we only need to count the number of matching integers in their sketches. This means we only need to store M integers for every document. By selecting appropriate value for M we can find a good trade-off between the accuracy of the estimation and the storage requirements.
How do we store document sketches for quick lookup of similar documents? We use a common technique: an inverted index. We compute sketches for all documents in our dataset, and for every integer we observe, we store a list of documents whose sketches contain this integer. To find similar documents for a new input, we only need to merge document lists for the M integers in its sketch. Time- and memory-efficient management of such indices is what Elasticsearch is very good at, so we use it to index our sketches and retrieve possible near-duplicates. We update the index continuously as new documents are spidered in Jobfeed.
Specifics for job ad deduplications
As I discussed above, we use shingling, min-wise permutation hashing and inverted indexing to find job ads textually similar to a new input document. For general purpose near-duplicate detection, this is often all that is needed: a simple threshold on text similarity between the input and the candidates is enough to decide whether the input is a duplicate. For job ads such threshold-based approach is not sufficient: as we have seen in our example, true duplicates can have similarity as low as 37%.
In Jobfeed, we complement text-based retrieval with a few more tricks:
- We apply machine learning and rule-based techniques to remove irrelevant content (boilerplate text, navigation, banners, ads, etc.)
- We use our semantic parser trained specifically for job ads to identify text sections containing job description and candidate requirements (as opposed to the general description of the advertising company, which is often repeated on many unrelated job ads of the company)
- For every new job ad, we use shingling as described above to find candidates with substantial text overlap.
- We train a classifier to predict whether a given input and a retrieved candidate are indeed postings of the same job. The classifier relies on the text overlap and the match between other properties of the postings, as extracted by our semantic parser: organisation name, profession category, contact information, etc.
How accurate is the system I outlined above? Does it find all duplicates? Does it cluster together postings that are not duplicates? A short answer would be that the system is “pretty good, 90%-ish”. A longer answer would require a discussion of many possible ways to evaluate such a system – something I will do in a follow-up post.
Some useful links