Lab 2. Employee Flight Risk lab

OIDD 2550, Professor Tambe

Learning Objectives for this lab

Use Python to build models to predict a target variable.

Consider how differences in the costs of errors can inform the choice of which machine learning models to use.

A first introduction to how issues like “bias” can arise when using machine learning models.

Data Context

Hyper-attrition has plagued HR management for a number of different occupations, including for local package delivery people and software developers. In the HR analytics space, there is significant interest in building products that help companies predict who is likely to leave and when they will leave. Such tools can help considerably when engaging in future workforce planning or by allowing managers to take steps ahead of time to retain high-performers.

This lab asks you to build a machine learning based prediction tool and evaluate it in terms of potential for perpetuating algorithmic bias in promotion decisions, a subject we will talk about in more detail in the second half of the semester. For submission, create a new document with your name and answers to each of the questions in the assignment.

To build the algorithm, you have access to historical employee attrition data from a very large firm. In these data, the labeled column to be predicted is called attrition, and it indicates that the employee left the company within a specific time window. The purpose of the algorithm is to recommend when managers should intervene to try and retain an employee.

Deliverables

For this assignment, you are encouraged to discuss with others, but please complete the assignment and submit on your own. Please submit a response document (e.g. a Word or Google Doc) with answers to each of the questions below. You do not need to submit any documents in which you do the work, or a Python notebook, or to paste the original question into your response document, and your answers need not be lengthy. For instance, when a question asks for a number, simply reporting the number is fine. When it asks for an explanation, one or two sentences is usually sufficient. If it asks for a figure, you can take a screenshot and paste it into the document. Assignments are to be submitted through Canvas. See Canvas for the due date.

1. Building and interpreting a machine learning model

Using Excel or any other software you prefer, browse the data to get a sense of it and to learn what we know about each worker. There are a lot of fields, including compensation, demographic attributes, job role and many others. Take a minute to consider how challenging it would be to build an effective model with all these fields “by hand”, and how much easier it is to let machine learning techniques discover the best model when given labeled training data (i.e. examples of correct answers).

Question 1a

Build an algorithm that predicts whether or not an employee is likely to leave (i.e. the attrition variable takes the value ‘Yes’). For each of the classifiers in the table below, fill out the table when the algorithm is evaluated on the test sample. Use a 70-30 split when dividing the data into training and test data sets. You do not need to make any adjustments to the default classifier settings in scikit.learn.

Classifier	Accuracy	Precision	Sensitivity
Logistic regression (functions)
Random Forest (trees)
Naive Bayes (bayes)

Confusion matrix for Logistic regression

ROC curve for Logistic regression

Question 1b

As a point of comparison, consider a “do-nothing” strategy (no algorithm), which is equivalent to making the prediction that nobody will leave. What is the accuracy of such a prediction on the test data?

Question 1c

For the employee flight risk context, explain--in words--what is meant by “true positive”, “false positive”, “true negative”, and “false negative”.

2. Costs and Benefits

The point of a machine learning algorithm is to allow managers to intervene and to retain workers who might have otherwise left. Imagine that the principal way you retain such workers is to promote them, and that offering employees a promotion is 100% effective in retaining them (i.e., it always works).

Question 2a

You estimate it costs $20,000 to promote an employee (which includes training plus higher compensation costs), and that for employees in these data, it costs $100,000 to lose the employee. Based solely on these considerations and the recommendations of the algorithms above, which of the three algorithms discussed above would yield the highest financial returns when run on the test sample? Consider, for example, the savings from “true positives”, as well as the money wasted by promoting “false positives”.

Question 2b

How much would you save over a “do-nothing” strategy, which can be expected to cost the employer about 8.1 million dollars (81 employees leaving * $100,000 each time you lose an employee)? Please justify your answer with calculations from the test data.

Question 2c

If the costs of retaining an employee had been $50,000 instead of $20,000, would your answer have changed? Please justify your answer with calculations from the test data.

3. Algorithmic Bias

Question 3a

What is the fraction of women in the overall sample (training + test)?

Question 3b

What is the rate of attrition for men and women in the overall sample, when computed separately for each? You can compute this using any tool you wish. In Excel, for example, you could filter each gender category, and then compute statistics on the filtered samples.

Question 3c

According to the Naive Bayes algorithm used above, what fraction of the men and women in the test sample receive a promotion? One way to compute this is to download the data into Excel and use PivotTables. (Making this type of computation workflow easier is what companies like Weights and Biases are trying to do).

Question 3d

Promotion and employee selection considerations are governed by guidelines set by the Equal Employment Opportunity Commission (EEOC). Algorithms such as these are considered to be tests for being selected for promotion. Please read this summary of the four-fifths rule in the EEOC guidelines carefully. Does the use of your algorithm in the prior question violate employment law? Justify your answer with numerical calculations.

Question 3e

You are advised that because many variables are correlated with Gender, a proper approach would be to remove Gender as well as its correlates. These correlates include Overtime, NumCompaniesWorked, TotalWorkingYears, JobRole, BusinessTravel, StandardHours, TrainingTimesLastYear, JobLevel, MaritalStatus, and Department. If you remove these from the NaiveBayes model, along with Gender, how do the financial returns of the modified algorithm change compared to the original NaiveBayes model which includes all variables?

Question 3f

If you were the head of HR for this organization, given what you’ve learned about the costs/benefits and the biases in the data, what would you recommend to company leadership (e.g. would you use one of the full models from Question 1a, a model without covariates that correlate with gender from Question 3e, or not use any algorithm at all)? There is no right or wrong answer here, but for full credit, provide support for whichever position that you take.