Textkernel has been developing the use of nested fields in its semantic search product. In this blog post Ruben Geerlings, Head of Architecture at Textkernel, explains how this can be used for more complex search queries and better search results.
By Ruben Geerlings
Nested field search (field-in-field search) is a new feature that can be added to our semantic search engine. It adds the object data type to the data model. Objects can have any number of fields themselves (the nested fields). This might not look like a big change on the surface, but it adds the power of relational database queries (think SQL JOIN) to our search engine which opens up a lot of possibilities.
Faceted search has been at the core of our Search! product since the beginning. When you are searching for candidates or jobs, the facets make it visible how your database is distributed along different axes. Suppose you are searching in your database for a suitable candidate. You can type in a few deliberate keywords, such as a job title or a skill, but you might still get thousands of results back. How can you refine your search further?
Facets are very helpful to get a quick overview of what your search results consist of. You can view all possible values for specific characteristics such as years of experience, preferred industry, availability, and education level. In the education level facet for example, you can see that you have 202 candidates with a bachelor’s degree, 116 candidates with a master’s and only 1 with a PhD. Facets make it possible to choose and drill down on specific criteria that are relevant to your search.
What makes faceted search particularly useful is that you are able to quickly switch between entering keywords and selecting or unselecting facets, refining your search results interactively.
Beyond faceted search
Faceted search also has its limitations. Facets are independent from one another. In the underlying data structure we implement that by calculating facets on a field and assigning one value per field per candidate. You might have one facet for years of experience and another one for preferred industry. But how can you search for candidates with more than 5 years experience in a particular industry?
Even when we allow multiple values in the industry field – one entry for every industry the candidate has worked in – we are still not able to query for a specific number of years in the industry because that relationship simply does not exist in the data model. All relationships disappear after being ‘flattened’ to fit in the data model.
This is where our nested fields feature comes in. The solution is to define object fields that preserve these relationships. For example we can define an industry field with nested fields: name and years of experience. We can then query for the particular combination of industry name and years of experience. In the same way we can define object fields for all the various details that we parse out of a CV: job experience, education, skills, etc. In a search engine without nested fields we can query all of the details but the correlation between them is lost.
The screenshot above shows how the user interacts with nested fields: after typing a keyword such as ‘developer’ a breadcrumb appears in the query box. When the user clicks on it a number of additional nested fields are shown. In this example: years of experience, and recency, which can be used to refine the ‘developer’ query further.
Nested field queries
With nested fields we can create queries that express relationships between different fields. This is traditionally only possible in relational databases. In SQL databases for example, you can model a candidate as an entity with a one-to-many relationship to another entity work experience, the work experience entity itself can have many attributes. To find candidates with a certain combination of work experience attributes you would then construct an SQL query consisting of a JOIN operator on the candidate and work-experience tables.
Similar queries are possible in our Search! product. The difference with SQL is that all results are also scored on relevance as they pass through the search engine’s custom scoring formula. This means that the most relevant candidates are shown at the top of the results.
Examples of searches that are possible with nested fields:
- “Candidate must have expert knowledge of Python.”
- “Candidate should be fluent in Mandarin or Cantonese.”
- “Candidate must be available for project work between May and June.”
Developing the nested fields feature was challenging. At the start we tried to keep the scope small but pretty soon we realised that if we wanted to get all the benefits from this new feature we had to put it at the core of our engine. Half a year later we had ported all existing features to support the new data structure and updated half of our code base in the process. We were excited to finally deploy nested fields into our semantic search product. From a technical perspective it could just as well have been a major version upgrade!
As described in a previous blog post, Textkernel has switched to using Elasticsearch as the underlying search engine. Fortunately Elasticsearch has built-in support for nested queries. We did have to make a decision on how to map our data model into the Elasticsearch index structure. Elasticsearch gives two options for implementing nested fields:
- Using parent-child relationships between documents.
- Using nested objects.
Option 1 uses Lucene’s query-time joins under the hood. This requires indexing the object fields as separate documents and creating explicit parent/child references between them. The downside of this approach is that queries take longer to execute because they have to combine query results from multiple indices.
Option 2 uses Lucene’s index-time joins to store the object fields together with the parent document as a document block. The parent-child relationship is implicit in the index structure. This gives better query performance as it can perform the nested fields query in the same index.
We chose to implement option 2 because search response times are more important than index performance in order to guarantee the best user experience.
The nested search feature proved to be quite a bit bigger in scope than expected. We had to upgrade the data model and query language which in turn affected all existing features in some way. We do think that this is a good foundation on which to build new query features on in the future. We are excited to see what use cases for nested fields our customers come up with and how this will change the way in which our Search! product will be used.
I would challenge anyone using our semantic search product to come up with a complex search request that they cannot currently express as a query. There’s a good chance that there is a solution and that it makes use of the new nested fields feature.
About the author
Ruben Geerlings is Head of Architecture at Textkernel. He has worked at Textkernel for over seven years. He started working on Sourcebox and subsequently moved to development of the Search! and Match! products. In his current role he is responsible for the scalability of Textkernel’s SaaS platform. He is also one of the organisers behind Textkernel’s Innovation Week. His innovations include a “virtual agent for online applications” and a “crowdsourced job recommendation engine”.
Are you curious about Textkernel? We are growing and hiring!