Machine Learning on CVs

Published on 2017-04-08 | Jérémy Grèze
Machine Learning on CVs

How can we make candidates’ data valuable? This is the challenge Mohamed gave me for meetup La Claque: to present to recruiters my work as a data analyst on HR and recruitment data.

To make this concrete, he sent me 368 applicant CVs to analyse - notwithstanding, a small message: “have fun, see you on April 5th”. 😉

Overview of 368 CVs

I started by converting files (mainly PDF and DOCX files) using the Apache Tika library (explanations here). Then I loaded the set into Dataiku DSS software.

The first step in a data science project is the analysis and understanding of the data. I went through the CVs, and quickly understood that the candidates were applying for a commercial position (even if you find candidates a little off-topic). Also, about 10% of candidates applied via Kudoz which is an application and they had a very standardised CV model.

I started building the dataset. Each row is a candidate and each column is a variable that I will use in the analysis. First, it allows me to conveniently explore my data, and then I can apply Machine Learning algorithms with these variables.

I was able to quickly build the following variables:

I also kept all the words from the CV to apply text mining transformations: deletion of stop words, reduction to the root of a word (or stemming), counting the frequency of words per CV.

Overview of data preparation
Data preparation in Dataiku DSS.

Indeed, I could have built more variables. Good suggestions were given during the meetup. One could try to extract the dates to find the number of years of education or work experience, or work on the description of experiences. Commercial candidates often provide their revenue and turnover figures and we could try to extract this information.

Some results of the analysis were presented (see slides). For example, most CVs contain between 200 and 400 words, and 40% of candidates would have a master’s degree.

Map of applicants
Distribution of applicants in France according to the extracted postal code.

After the first analyses and the generation of the variables, I proceeded by applying Machine Learning algorithms to the data. The goal was to find correlations and results that can’t be seen with the naked eye.

By applying a clustering model (unsupervised machine learning), I grouped candidates according to their similarity. The advantage compared to a classical segmentation is that one does not define the discriminating criteria in advance, the algorithm takes care to find out which ones are relevant. The disadvantage is that it is sometimes difficult to interpret the results.

One result identified 3 groups (plus 1 group of outliers): a group that has been called “Parisian Bobo” during the meetup (a name given to the fashionable middle-class), a majority group that uses more general words, a group largely consisting of engineers, and 3 people who included their cover letter within their CV.

Clusters results in French

By applying a predictive model (Supervised Machine Learning), I let the algorithm determine (or learn) the correlations that explain a target variable, and I could then predict this target variable on another dataset (example: future candidates).

I didn’t have an interesting variable to predict here (for example: was the candidate offered an interview?). I tried a predictive model engaging the number of words in a CV. The results were disappointing since the algorithm learned too much on particular cases (over-fitting), especially on the candidates’ place of residence.

In conclusion, to conduct data science projects on CVs and recruitment, we need a big quantity in order to eliminate specific cases.

Also, the business objective of these studies must be defined in advance. The use cases can be multiple: make filters on CVs before offering an interview, prevent early departures in companies, propose internal transfers in large groups according to the career paths of employees, etc….

The importance of associating different profiles with these projects (HRs for knowledge of the business, and data scientists for skills in data science) was of course debated, as well as the need to maintain an ethical framework. Algorithms will digest what we feed them and will thus reproduce existing discriminatory behaviour.

I thank Mohamed for inviting me to his team’s meetup and proposing this challenge! Of course, I am committed to preserving the anonymity of data and CVs. I hope this meetup was educational and opened eyes on the possibilities of data.