Scaling Semantic Search beyond 100 million CVs and jobs with Elasticsearch

Textkernel recently introduced version 3 of its products Search! and Match! with Elasticsearch as the new underlying search engine. Ruben Geerlings, lead developer of Search!, talks about the decision, but also the challenges and solutions involved in switching to Elasticsearch.

by Ruben Geerlings

Releasing version 3 of Search!

First, let me briefly introduce the product: Search! is our semantic search software that allows our customers to find the right candidates and jobs in their own database and in external sources. We developed a rich user interface with unique features such as the nice-to-have, should-have, and must-have selector and the synonyms and related terms widget to perform complex search requests in a user-friendly manner. Search! uses semantic technology, based on natural language understanding, to interpret the user’s intent and transform this into a query that the underlying search engine understands.

We recently released version 3 of Search!, which marks the introduction of Elasticsearch as our new underlying search engine in order to offer out-of-the-box solutions for scalability and high availability.

Limitations of Search! 2

As the adoption of Search! increased and the demand for scalability was getting bigger and bigger, we realised that we had to switch our underlying search engine to support flexible scaling options. A single Search! 2 installation practically supports up to 15 million CVs or jobs. We wanted to support more. Much more. After talking to several customers about their future scalability needs we decided to design the new version of Search!, version 3, to be able to scale in the order of hundreds of millions of CVs and jobs.

From single node to cluster with Elasticsearch

The previous version of Search! presented a challenge because of its single node architecture. A single index (collection of searchable documents) was constrained to occupy a single machine. Elasticsearch is an open source search engine built on top of Apache Lucene. Where Lucene provides a high performance search engine for searching on a single disk, Elasticsearch offers a distributed search engine operating on a cluster of machines using a technique called sharding.

Sharding works by splitting up a large index into smaller indexes (called shards) which are individually powered by a single Lucene index. The shards can coexist on the same machine or be distributed over several machines (called cluster nodes). A search query is sent to all shards simultaneously and the results are combined to create the overall search result.

Example: suppose we want to search in a collection of 100 million CVs. A single machine does not possess the performance requirements to contain all of the data and perform fast queries. By splitting the index up into 10 shards of 10 million CVs each, they can be distributed on an Elasticsearch cluster of 10 machines which can be queried simultaneously to yield fast query responses. Moreover, by telling Elasticsearch to create replicas of the shards we can continue querying the data even when several machines in the cluster go down.

Decision for Elasticsearch

Elasticsearch has been gaining popularity as a search engine because of its ease of deployment and developer-friendly API. It has a very active development community that frequently releases new versions. This, combined with the ease at which our customers would be able to scale out (a customer can start with a cluster of three Elasticsearch nodes and then incrementally add nodes to the cluster as the usage increases), made us decide for Elasticsearch as our new backend.

Porting semantic search features

Choosing Elasticsearch turned out to be the easy part. Porting all of the semantic search features turned out to be not so easy. What was initially estimated at two months work took about six months to complete. Most of the effort went into transforming the semantic search query. Whereas a typical Elasticsearch query consists of a single field keyword search, the queries generated by Search! consist of a combination of fields (such as location, job experience, education and skills), having different modalities (nice-to-have or must-have), taking into account fuzzy similarities (such as searching with related words or searching within the vicinity of a location), and applying several filters and score boosters on top. To achieve this we made heavy use of the Elasticsearch compound query constructs (bool and dis-max) which were nested several layers deep.

“You cannot expect a query that returns good results for generic Web pages to return the best results when searching in CVs and Jobs specifically”

Ranking and evaluation

A lot of research went into balancing the ranking formula. We needed to make sure that all aspects of the query contribute the desired amount to the final match score. It took many iterations until we got everything right. Elasticsearch has no domain knowledge. You cannot expect a query that returns good results for generic Web pages to return the best results when searching in CVs and jobs specifically.

To ensure that the ranking in Search! 3 is on par or better than in Search! 2 we compared the search results using the scientific method. A panel of expert users assessed a large number of search results, judging them to be relevant or irrelevant without knowing which version the results were coming from. The evaluation metrics then showed which version scored better on which queries.

Custom scoring plugin for Elasticsearch

In order to implement the desired scoring formula in Elasticsearch, we developed a custom Elasticsearch plugin. The plugin is delivered with the Search! product to be installed on every Elasticsearch node. An Elasticsearch plugin can modify every aspect of the underlying Lucene index, which in this case is used to implement the custom scoring formula that was specifically designed for Search!

Search! 3 and beyond

Elasticsearch has helped us to reach our goal of scalability and high availability in Search! 3. And furthermore, the work done on porting the ranking formula has even improved the overall quality of search results. The possibility of custom Elasticsearch plugins allowed us to implement our specific scoring and ranking formula. We are very excited about the future possibilities of Search! We are already building new semantic search features on top of Elasticsearch, which we hope to release early this year. I am confident in Elasticsearch to scale with our customers growth and to stay with us for many Search! versions to come.

About the Author

profile-2014-march-crop-400x400Ruben Geerlings is the lead developer behind Search! and has worked at Textkernel for six years. He started working on Sourcebox and subsequently moved to the Search! and Match! products. Three years ago he initiated the Textkernel Innovation Week and he has helped organise the event every year since. His innovations include a “virtual agent for online applications” and a “crowdsourced job recommendation engine”.
Born and raised in Amsterdam, Ruben graduated in Computer Science at the University of Amsterdam specialising in Software Engineering. When he is not coding he spends most of his time trying to teach his one-year-old son to speak Dutch and Mandarin.

Curious about Textkernel? We are growing and hiring!