HR is a dynamic part of an organization tasked with recruiting, managing, and training a business’s most valuable asset: people. The multitude of tasks that land on HR desks often means they don’t have much time to develop or scale their processes. But with the urgent changes in business models related to digital transformation, remote work, and economic disruptions, inefficiency in the HR department will quickly spill over to negatively impact overall business operations. Unfilled roles, in particular, will drastically reduce overall business productivity, which, when left unchecked, will impact business revenue.
Talent acquisition and management have become challenging for businesses. Amid the volatile economy and changing preferences in work setups, the pressure is on the HR team to find and acquire the right people. One way to help is to harness the power of technology, particularly in the recruitment process, through artificial intelligence (AI).
What Is AI?
Artificial intelligence refers to the simulation of human intelligence by machines or systems. These systems can react to changes in their environment and adjust responses depending on the data received—without any human intervention.
AI has various roles in different industries. In the HR department, integrating machine-based intelligence can be the key to transforming core organizational processes, including recruiting the right people and retaining top talent.
How AI Helps in Talent Acquisition and Management
While AI will not replace HR managers and associates, it can support them in the work they do. Here are six ways AI can transform talent acquisition and management:
1. Screening Candidates
The HR team spends considerable time screening and filtering candidates. In the 2021 Recruiter Nation Report, one of the hiring challenges that staffing agencies describe is “too many candidates” compared to in-house recruiters.
Screening candidates creates a bottleneck in the hiring process. Recruiters often invest heavily in the recruitment process—spending inordinate amounts of time reading and sorting resumes and conducting interviews—only to discover that the candidate is not suitable for the position they are trying to fill. Not only does this hamper the hiring process, but it’s also costly.
Hiring misalignment is rampant. According to Glassdoor, as many as 30% of employees resign from their posts within the first 90 days of employment.
An AI-powered talent acquisition system can filter candidates effectively and without bias. It can evaluate candidates based on skills and credentials and against the job requirements. By focusing on a shorter list of better applicants, HR teams can gain deeper insights and better identify eligible applicants. This increases the quality of new hires and reduces overhead expenses.
2. Sourcing the Right Talent
AI can upgrade the talent acquisition process by sourcing and engaging the right talent. An AI-based sourcing technology can reach out to candidates using data scraped from different data sources and match aspiring candidates to job openings with high accuracy. Using the wording requirements on the job description, AI can learn related industry-specific terminologies and expand the search.
Furthermore, AI can also use deep learning to match skills required for the job. For example, if an applicant claims to have expertise in software the HR team can verify the claim by using the system to ask about the prerequisites of the skill and identify whether the candidate has the same or related skills.
3. Opening Doors for Hiring Global Talent
With AI, HR teams no longer have to limit their recruitment efforts to traditional in-office work, because with the ability to automate aspects of job posting, talent sourcing, and candidate review, it becomes possible to reach and accept more candidates, making it much easier to open the door to hiring remote or hybrid workers. Accessing a global, diverse workforce brings many benefits to an organization, filling jobs being the least of them. Expanded creativity, a modern culture, and attractive working arrangements help to fill those empty roles with people who are up to the challenge.
HR teams can explore ways to streamline time tracking and payroll tasks through technology. For instance, an online time tracking tool can support remote team collaboration. Additionally, the HR team can also be more open and flexible in hiring hourly workers. Employees can clock in and out without a thought about payroll and accountability.
4. Freeing Up Time
Another valuable advantage of implementing AI in HR is automating administrative tasks, such as payroll processing, data management, and report generation. HR teams spend most of their time on administrative activities, which leaves them with little time to focus on improving employee performance, productivity, and engagement. These are areas that can impact organizational performance.
According to Gallup, only 34% of employees feel engaged at work, with 16% reporting being actively disengaged. Employee engagement matters because it can impact productivity, retention, and reputation. Organizations with high employee engagement rates are 21% more profitable.
5. Measuring Performance
Getting the right people is only part of the puzzle. HR teams must also ensure that the new hires perform well within the organization. As roles evolve, HR must measure employee performance regularly, including defining roles, assessing fair compensation, planning promotions, and ensuring business goals are met.
These processes can be done digitally through performance management software. For instance, an AI-powered system can collect employee data to help quantify an employee’s performance, show areas for improvement, and explore other opportunities employees can take advantage of. Such systems also facilitate a more objective employee assessment.
6. Candidate Support
AI is also critical for ensuring that an organization maintains positive employer branding. This means ensuring that candidates have a good experience, since in a highly competitive job market, top talent can decide whether or not to join an organization based on their experiences during the hiring process. Weeding out unqualified candidates will help HR focus on the best prospects to maintain timely communication and give them what they need to stay in the running.
AI can also be instrumental in providing more clarity about the application process, such as through chatbots. A complicated application process, lack of clarity around job roles, and unclear instructions can cause candidates to withdraw their applications and leave a poor rating. AI can give applicants real-time direction and guidance to complete their application process. It can also provide them with updates on the status of their application.
Technology in Talent Acquisition and Management
AI will revolutionize HR. Introducing AI in the recruitment process does not mean eliminating the “human” in human resources. AI complements the talent acquisition and management processes and adds value to the operation by supporting HR in being more efficient and effective. AI can serve as a tool to assist and optimize productivity within the organization, from hiring the right people to ensuring they remain.
Dean Mathews is the founder and CEO of OnTheClock, an employee time tracking app that helps over 15,000 companies all around the world track time.
Dean has over 20 years of experience designing and developing business apps. He views software development as a form of art. If the artist creates a masterpiece, many people’s lives are touched and changed for the better.
When he is not perfecting time tracking, Dean enjoys expanding his faith, spending time with family and friends, and finding ways to make the world just a little better.
As we approach 2023, labor shortages and resignations continue to impact the US economy and job market. According to the US Bureau of Labor Statistics (BLS), 10.7 million positions remained open in June 2022 while roughly 4.2 million employees resigned; job vacancies remain high as talent demand continues to outweigh supply (US Job Openings from Bloomberg, August 2022).
Recruiting and retaining talented employees is a constant priority among employers. But talent has quickly become among US employers’ scarcest and most valuable commodities. Companies are finding it more difficult to protect their existing talent pool, let alone fill roles that remain open. These trends are broadly putting business performance at risk.
How are US companies coping with labor shortages?
US companies have had to change course as labor shortages persist. Many companies have shifted some focus from recruiting new employees to retaining their existing talent as they find themselves competing for talent on all fronts.
Among other roles, the demand for talent recruiters is high as a result. There are currently more than 40,000 unique job openings for recruiters in the US, and the demand for recruiters made up 0.4% of the 10.7 million US jobs available in June.
The competitive nature of the labor market also has increased recruiters’ workloads, resulting in a more widespread desire to resign. A recent report shows that 77% of high-ranking recruiters are open to changing jobs (Recruiters Are Burned Out article from Bloomberg, July 2022).
Rising demand for compensation and benefits specialists
Companies now recognize that in the tight labor market, a more specialized approach is necessary. They are aligning their management strategies with labor trends to better retain employees as a result. This includes adjusting internal compensation, benefits, and mobility tactics to retain employees who might otherwise seek out better jobs.
In fact, the demand for compensation and benefits specialists in Q2 2022 rose to 23,500 jobs; an increase of 32% compared to Q2 2021, according to Textkernel’s Jobfeed tool (Figure 1). Demand for these roles has surpassed demand for other jobs since Q3 2021 as well.
A closer look at the demand for employees in these categories shows that the most in-demand jobs in Q2 2022 were Benefits Manager and Benefits Specialist roles (Figure 2). This reinforces the idea that companies have prioritized benefits as a way to attract and retain talent.
|Director of Benefits and Compensation||919|
The highest demand for these benefit roles appear to correlate with high competition of certain industries. The Professional, Scientific, and Technical Services field has the highest demand for compensation and benefits jobs (Figure 3). Followed by the Finance and Insurance industry which likewise has a high number of vacancies for new compensation and benefits staff.
These industries in particular have a highly skilled labor base that’s difficult to acquire under normal circumstances. This gives skilled workers an even stronger upper hand in demanding better compensation and benefits. Additional reports show that US companies will raise employee pay by an average of 4.1% over the next year (CFODive, August 2022).
US companies also are taking steps to become competitive by making work more flexible for employees. While 86% of US companies are hiring employees at the higher end of salary ranges to attract talent, 84% of companies also are increasing work location flexibility to help retain workers (wtw, August 2022).
Offering more generous benefits
Leading companies are demonstrating their resolve in these areas. PwC, one of the world’s largest accounting firms, is offering more generous benefits as a result of these challenges. The firm recently announced they will invest $2.4 billion to retain staff and compete amid a shortage of accountants and ongoing turnover. Personnel at PwC now can take 12 weeks of paid parental leave and choose from a more streamlined menu of benefits, among other options (NA Employers Rethinking Work and Reward Programs from Bloomberg Tax, May 2022).
In order to implement these new compensation and benefits policies, PwC has posted a number of job openings online, specially with Compensation and Benefits specialist positions in mind. In fact, this is part of a broader trend. As the accounting industry has been particularly hard hit by labor shortages, all of the Big 4 accounting firms, including PwC, Deloitte, KPMG, and EY, have many openings for Compensation and Benefits specialist positions.
Making data-driven decisions
Organizations challenged with recruiting and retaining talents may benefit by taking action to address the labor trends discussed above. One effective way to accomplish this goal is to use data to better understand the labor market. This approach enables a firm to stay competitive and differentiate itself in the hiring market, retain employees, and attract new candidates.
Column CVs are visually appealing and are becoming widely used by candidates. We estimate that currently at least 15% of CV documents use a column layout. However, properly dealing with this layout is a surprisingly difficult computer vision problem. Since third party tools do not work well on CVs or are very slow, Textkernel already had a system in place to deal with column layout documents. We have greatly improved this system by applying various AI techniques. As a result, our handling of column CVs in PDF format has improved significantly, resulting in better extraction quality regardless of the document language.
The first step in an information extraction pipeline is to convert documents into raw text from which information can be extracted.
The system’s ability to perform well in this first step is crucial: any mistake will impact the performance of subsequent steps. Generating a well-rendered text representation for many different types of documents is a difficult problem to solve.
A simple method, that renders the text in a top-down, left to right order is usually sufficient for documents that have a standard layout.
However, CVs come in various layouts, which are easy for humans to understand, but can be challenging to a machine.
A common layout we find in CV documents is the usage of columns. Column CVs are visually appealing and widely used by candidates applying for a job. Candidates want to neatly organize the information in their CV and provide visual structure, for example by having a sidebar that contains their contact information.
If a system were to use the basic left-to-right, top-down order rendering for this type of document, that would generate a rendering where the information from different sections of the CV is mixed together (see image aside).
Instead of reading the columns one after the other, the system would mix bits and pieces of each column together.
An imperfect text rendering can still be useful for certain tasks: searching for keywords is still possible, and humans can still easily read the document.
But when automated systems try to extract structured information from an imperfect rendering, problems compound very quickly: finding the correct information becomes incredibly challenging.
At Textkernel, we strive to offer the best parsing quality on the market, which means that the widespread use of column based layouts demands our full attention. Keep reading to follow us on our journey to create a system that can understand creative document layouts and see how we were able to leverage machine learning to bring our Extract! product to the next level.
Our Previous Approach
Our system was already able to handle several types of document layouts, being able to identify sections of a document that should be rendered independently.
The approach has 3 steps. In the first step, the text content of the PDF is scanned and visual gaps between them are identified (see below an example). In the second step, a rule-based system decides whether a visual gap is a column separator or not. As you can see in the example below, not all visual gaps are column separators and the left-to-right reading should not be interrupted for these gaps. Based on these predictions, in the third step the text will be rendered by separating all identified columns.
A naive approach that always renders the big visual gaps separately would have issues on several types of layouts, as an example a key-value structured layout would break the key from the value and separate it in its text representation, leading to incorrect extraction of fields.
Our system achieved good rendering for many cases but was still failing to predict certain column separators. By design the system was very precise when predicting that the visual gap is a column separator (i.e. precision of the positive class is very high), the rationale being that predicting a column separator when there is none (i.e. a false positive) is very costly: the rendered text will be wrong and as a result it would affect the parsing quality. In order to achieve this high precision, its coverage was more limited (i.e. precision of the positive class was favored over the recall of the positive class). In addition, the system is also very fast (tens of milliseconds), making it a quite efficient solution.
Improving such a system requires a model centric approach: we have to focus our efforts in changing the code. For example, increasing the coverage of supported cases is very difficult. When we encounter a new case, we need to implement a new rule for it, make sure it is compatible with the rest of the rule base and choose how the rules should be applied and combined. Complexity can grow very high with the more rules we add.
Ideally we would like our solution to be data centric, so we can improve its performance by collecting examples of how the system should perform, and focus our attention on curating and improving the example data. We would also like a solution that preserves our processing speed.
The first improvement trial
We analyzed several third party solutions that might help us improve our system, without going through all the difficulties of managing a rule-based system.
Most of these systems apply computer vision methods to extract text from an image representation of the document. These require computationally expensive algorithms and are therefore quite slow (i.e. seconds), and also difficult to manage for on-premise installations. We were also surprised to see that their performance was not much better than our previous rule-based approach. Therefore, we abandoned the third party track.
As we are focusing on improving our column handling, we don’t need to identify all the gaps in the text, only the larger vertical visual gaps should correspond to columns. With these simplified assumptions, we came up with a new method to detect the largest vertical visual gap from a histogram of the whitespace in the image representation of the document, as can be seen in the image below.
Looking at this representation, we can see a distinction between both types of layouts in terms of whitespace distribution, and we used this representation to train a neural network model for classifying between column layouts and regular layouts.
Note that this method does not fit all our requirements: we still don’t have the coordinates needed to separate the column content. In addition, we also noticed the processing speed will be an issue if we continue on this track.
Given the expected effort still to get this method to a usable state, we took a step back and went back to the drawing board.
Our New Approach
We already stated that in our ideal scenario we would be able to improve our system by feeding it good quality data. How can we move from our model centric approach into a data centric approach?
At the core of our solution we have a single type of decision: deciding if a visual gap is separating related or unrelated content (e.g. a column separator). This is a binary classification problem, for which we can train a machine learning model to replicate the decision.
By making use of our rule-based system we can generate our training data by converting our rules into features and our output decision as the label we want our new model to learn. By doing this we can begin to focus on improving the collection and curation of more training data, and easily retrain the model everytime we want to improve it, instead of adding more rules to our code base.
We have a new approach and we need to validate it. For that we follow the model development pipeline:
We start with selecting the data for training our machine learning model. Unlike a rule-based system that needs a few hundred examples to develop and test the rules, we will need several thousand examples to learn our model.
We started with problematic documents that our customers kindly shared with us in their feedback. However, this set was quite small (about 200 documents). How can we find thousands more column CVs when they only account for about 10-15% of documents? Luckily, from our initial attempt we have a neural network based column classifier. Although not sufficient for replacing our old rule-based system, it’s a great method to mine documents with a column layout. Even if this classifier is not 100% accurate, it is still better than randomly selecting documents (which will have an accuracy of 10-15%). In addition, we also collect a random sample of documents to make sure our method works well across all layouts (i.e. ensure we do not break rendering of correctly working document layouts).
Generation of the Dataset
To generate our dataset we process our document sets through our existing rendering pipeline. For each visual gap, the target label is initially set to the decision made by our rule-based system. We bootstrapped the features by using the variables and rules computed in this decision. In addition, we added several new features that quantify better some of the properties of column layouts.
In the previous step we generated a pseudo-labeled dataset: the labels originate from our existing system and are not verified by a human. To ensure that our machine learning model will not simply learn to reproduce the mistakes of the rule-based system, we also manually annotated a small sample of column CVs. Since this is a time consuming task, having potential column CVs as identified by our neural network based column classifier helped to speed up our annotation process.
We can now train a machine learning model to mimic our ruled-based system decisions. We started our experiments with the decision tree algorithm. This is a simple algorithm to apply to our dataset and very effective, offering good classification performance while very fast to apply, a key characteristic we wanted in our approach.
However, decision trees have several problems: they are prone to overfitting and suffer from bias and variance errors. This results in unreliable predictions on new data. This can be improved by combining several decision tree models. Combining the models will result in better prediction performance in previously unseen data.
There are several ways to achieve this, the more popular methods being bagging, where several models are trained in parallel on subsets of the data: an example of such method is the random forest. Another ensemble method is boosting, where models are trained sequentially, each model being trained to correct the mistakes of the previous one: an example of such method is the gradient boosting algorithm.
After testing a few options we settled on the boosting approach using a gradient boosting method.
Efficient Label Correction
Our new model was mostly trained to reproduce the decisions of our rule-based system because most of its training data comes from pseudo-labeled examples. The limited human annotations also makes it difficult to do error analysis and identify which cases the new model is misbehaving.
Even so, the added small sample of manually annotated data for column CV documents can already shift the decision in informative ways. As a result, the discrepancy between the predictions of the new method and the rule-based system can be analyzed manually and corrected. We call this approach delta annotation. This is an effective process of labeling only the data that will push the model into performing better.
At Textkernel we are always looking for ways to deliver the best quality parsing. Having quality data is essential for what we do, so of course, we already have implemented great solutions for this using tools such as Prodigy to facilitate rapid iteration over our data.
WIth this partially corrected dataset, we can retrain our model and we can keep iterating and improving our dataset by doing delta annotation between the latest model and the older ones. In our case, two iterations was enough to saturate the differences and reach a good performance at the visual gap level.
This enables us to follow a data centric approach, we can focus on systematically improving our data in order to improve the performance of our model.
We have a new approach that is more flexible than before, but we still have a big challenge. How can we be sure that better decisions at the visual gap level translate in an overall improvement in rendering at the document level (recall that a document can have multiple visual gaps). Even more important, does this translate into extraction quality improvements? If we want to be confident in our solution, we need to evaluate our system at multiple levels.
Firstly, we did a model evaluation to know if we are better at making decisions at the visual gap level. For this, we can simply use our blind test set and compare the performance of our new model with the old model. On more than 600 visual gaps, our new model makes the right decision in 91% of the cases as opposed to only 82% for our old rule-base system. However, visual gaps are not all equally important and some matter more than others: in our case, the visual gaps corresponding to columns are the most important to get right. For this important subset, we see a performance increase from 60% to 82%. In other words, we have more than cut in half the errors we used to make!
Secondly, we looked to see if the improvement in visual gap classification translates into better rendering (recall that in a document there might be multiple visual gaps). In other words, are we doing a better job of not mixing sections in column CVs? However, since multiple renderings can be correct, it is hard to annotate a single “correct” rendering (which would have allowed us to automatically compute rendering performance). Therefore, we had to do a subjective evaluation of the rendering. Using our trustworthy Prodigy tool, we displayed side-by-side the renderings of the new and the old system to our annotators (without them knowing which side is which). The annotators evaluated if the text is now better separated, worse, or roughly the same as before. The results on a set of about 700 CVs are really good: well rendered CVs increased from 62% to 90%.
Finally, we looked to see if better rendering translates in better parsing. We knew that in column CVs where the old system was failing, our parser would sometimes extract less information, in particular contact information like name, phones and address. Thus, the least labor intensive way is to simply check if the fill rates are increasing. On more than 12000 random CVs, we see that the contact information fill rates are increasing by 4% to 10% absolute. But more does not necessarily mean better! Thus, we also invested in evaluating more than 1000 differences between our parser using the old system and our parser using the new system. The results in the figure below show the percentage of errors our new system has fixed. This is our final confirmation that we now have in our hands a better parser! Great job team!
To summarize our improvements:
- Correct decisions at the visual gap level improved from 60% to 82% for visual gaps corresponding to columns.
- Rendering quality improved from 62% to 90%.
- Contact information fill rates increased by 4% to 10% absolute.
- Error reduction in contact information from 33% to 100%.
- Speed impact is negligible compared to our rule-based system (10ms extra)
Our extraction quality on column CVs is now better than ever. By leveraging machine learning to replace our rule-based system we can now correctly parse an even wider range of CV layouts.
Our main takeaways from this project are:
- It is important to choose the right approach. For certain problems, more complex approaches or ML models require a lot of time investment to get right and still have speed issues.
Experimenting with several approaches, even if abandoned, still brings value. These systems can be complimentary in parts of the pipeline (e.g. for efficient data selection).
- With the right data and ML methods, a rule-base system can be bootstrapped into an ML system with significantly better generalization capabilities.
Further improvements to the system can be done by improving the training data instead of the complex task of managing the rules.
- It is important to look at the global picture especially for systems with downstream tasks.
Local improvements need to be evaluated globally to validate their effectiveness
Don’t miss out on the great candidates that make use of these layouts!
About The Author
Ricardo Quintas has been working for Textkernel for 4 years as the Tech Lead Machine Learning.
Below you can find links to dedicated product pages including the latest Textkernel releases. Click to select a product that is relevant to you.
Annually for the past 9 years, Textkernel has dedicated a week to innovation and turned into an incubator for internal mini startups. This year, Innovation Week was bigger and better, 10 innovative projects were pitched and approved with almost 100 team members participating in a week full of ideation, cooperation and great entrepreneurial spirit. Great minds connecting!
The concept of Innovation Week is simple: anyone across the company can pitch an idea during a pitch session. The company members then have several days to vote for their preferred ideas and to indicate their availability to participate. Finally, the team captains select the members of their team and then set out to create a working proof of concept to present against all the other ideas.
It’s about diversity and disruption
“Innovation Week is about disruption,” says Mihai Rotaru, Head of R&D and co-organizer of the event. “It’s about looking at customer problems from a different point of view, it’s about exploring new technologies, it’s about going out of your comfort zone.” And he adds, “The mix of colleagues from different departments is the core of success in Innovation Week.” This is how teams bring together all the necessary skills and enough variety in perspectives for innovations.
Facilitating a bottom-up culture
Innovation Week has been held every year since 2013 (with a pandemic break in 2020) and it is firmly embedded in Textkernel’s corporate culture. “Innovation Week is important because it brings people together and it allows innovation bottom-up,” says Textkernel CEO Gerard Mulder. And that is felt throughout the entire company. “Knowing that the company cares about our ideas truly has a great impact,” says Hope Natell, Sales Manager North America.
The grand finale
It is at the end of the week that all ideas are shared with the company, in the form of a product pitch in front of all colleagues. This year, the jury and the Textkernel employees chose Umut Can Ozyar’s project “Journey” as the winner.
Umut describes his team’s winning product as an “AI assistant for more efficiency and a personalized candidate engagement throughout the recruitment process”. His team’s goal was to improve a recruiter’s life with the help of AI-based technology. “Our product can suggest targeted emails, messages, and interview questions unique for each job and candidate, accentuating their best qualities,” explains Umut. And who knows, maybe it will make its way onto the Textkernel roadmap soon.
After the presentations, Textkernel colleagues from offices around the world – the Netherlands, France, Germany, the United Kingdom and the United States – celebrated the closing of the event. After all, it is not only about the projects but also about a lot of fun and team spirit within the company. Or, as Textkernel CEO Gerard Mulder puts it: “It’s the best event of the year.”