[This work is based on this course: Data Science for Business | 6 Real-world Case Studies.]
When running a business, it takes money to make money. You make spending decisions every day for the better moment of your company. One of the most important investments you can make is in new people.
-
Hiring takes a lot of skills, patience, time and money.Most of all, small business owners spend around 40% of their working hours on tasks that do not generate income, such as hiring.
-
Companies spend from 15% to 20% of the employee’s annual salary, with more senior positions on the higher end of the scale.
-
It’s known that an average company loses anywhere between 1% and 2.5% of their total revenue on the time it takes to bring a new hire up to speed.
-
Hiring an employee in a company with 0-500 people costs an average of $7,645.
-
The average company in the United States spends about $4,000 to hire a new employee, taking up to 52 days to fill a position.
Our goal is to predict which employees are most likely to quit their job and a dataset has been provided to us.
Sources:
1 – Import the libraries and look at the dataset.
import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt
employee_df=pd.read_csv("Human_Resources.csv") employee_df.head()
Age | Attrition | BusinessTravel | DailyRate | Department | DistanceFromHome | Education | EducationField | EmployeeCount | EmployeeNumber | ... | RelationshipSatisfaction | StandardHours | StockOptionLevel | TotalWorkingYears | TrainingTimesLastYear | WorkLifeBalance | YearsAtCompany | YearsInCurrentRole | YearsSinceLastPromotion | YearsWithCurrManager | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 41 | Yes | Travel_Rarely | 1102 | Sales | 1 | 2 | Life Sciences | 1 | 1 | ... | 1 | 80 | 0 | 8 | 0 | 1 | 6 | 4 | 0 | 5 |
1 | 49 | No | Travel_Frequently | 279 | Research & Development | 8 | 1 | Life Sciences | 1 | 2 | ... | 4 | 80 | 1 | 10 | 3 | 3 | 10 | 7 | 1 | 7 |
2 | 37 | Yes | Travel_Rarely | 1373 | Research & Development | 2 | 2 | Other | 1 | 4 | ... | 2 | 80 | 0 | 7 | 3 | 3 | 0 | 0 | 0 | 0 |
3 | 33 | No | Travel_Frequently | 1392 | Research & Development | 3 | 4 | Life Sciences | 1 | 5 | ... | 3 | 80 | 0 | 8 | 3 | 3 | 8 | 7 | 3 | 0 |
4 | 27 | No | Travel_Rarely | 591 | Research & Development | 2 | 1 | Medical | 1 | 7 | ... | 4 | 80 | 1 | 6 | 3 | 3 | 2 | 2 | 2 | 2 |
- We need to predict the Attrition’s feature.
employee_df.columns
Index(['Age', 'Attrition', 'BusinessTravel', 'DailyRate', 'Department', 'DistanceFromHome', 'Education', 'EducationField', 'EmployeeCount', 'EmployeeNumber', 'EnvironmentSatisfaction', 'Gender', 'HourlyRate', 'JobInvolvement', 'JobLevel', 'JobRole', 'JobSatisfaction', 'MaritalStatus', 'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked', 'Over18', 'OverTime', 'PercentSalaryHike', 'PerformanceRating', 'RelationshipSatisfaction', 'StandardHours', 'StockOptionLevel', 'TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance', 'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion', 'YearsWithCurrManager'], dtype='object')
employee_df.info() # 35 features in total with 1470 data points each
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1470 entries, 0 to 1469 Data columns (total 35 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Age 1470 non-null int64 1 Attrition 1470 non-null object 2 BusinessTravel 1470 non-null object 3 DailyRate 1470 non-null int64 4 Department 1470 non-null object 5 DistanceFromHome 1470 non-null int64 6 Education 1470 non-null int64 7 EducationField 1470 non-null object 8 EmployeeCount 1470 non-null int64 9 EmployeeNumber 1470 non-null int64 10 EnvironmentSatisfaction 1470 non-null int64 11 Gender 1470 non-null object 12 HourlyRate 1470 non-null int64 13 JobInvolvement 1470 non-null int64 14 JobLevel 1470 non-null int64 15 JobRole 1470 non-null object 16 JobSatisfaction 1470 non-null int64 17 MaritalStatus 1470 non-null object 18 MonthlyIncome 1470 non-null int64 19 MonthlyRate 1470 non-null int64 20 NumCompaniesWorked 1470 non-null int64 21 Over18 1470 non-null object 22 OverTime 1470 non-null object 23 PercentSalaryHike 1470 non-null int64 24 PerformanceRating 1470 non-null int64 25 RelationshipSatisfaction 1470 non-null int64 26 StandardHours 1470 non-null int64 27 StockOptionLevel 1470 non-null int64 28 TotalWorkingYears 1470 non-null int64 29 TrainingTimesLastYear 1470 non-null int64 30 WorkLifeBalance 1470 non-null int64 31 YearsAtCompany 1470 non-null int64 32 YearsInCurrentRole 1470 non-null int64 33 YearsSinceLastPromotion 1470 non-null int64 34 YearsWithCurrManager 1470 non-null int64 dtypes: int64(26), object(9) memory usage: 402.1+ KB
employee_df.describe()
Age | DailyRate | DistanceFromHome | Education | EmployeeCount | EmployeeNumber | EnvironmentSatisfaction | HourlyRate | JobInvolvement | JobLevel | ... | RelationshipSatisfaction | StandardHours | StockOptionLevel | TotalWorkingYears | TrainingTimesLastYear | WorkLifeBalance | YearsAtCompany | YearsInCurrentRole | YearsSinceLastPromotion | YearsWithCurrManager | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 1470.000000 | 1470.000000 | 1470.000000 | 1470.000000 | 1470.0 | 1470.000000 | 1470.000000 | 1470.000000 | 1470.000000 | 1470.000000 | ... | 1470.000000 | 1470.0 | 1470.000000 | 1470.000000 | 1470.000000 | 1470.000000 | 1470.000000 | 1470.000000 | 1470.000000 | 1470.000000 |
mean | 36.923810 | 802.485714 | 9.192517 | 2.912925 | 1.0 | 1024.865306 | 2.721769 | 65.891156 | 2.729932 | 2.063946 | ... | 2.712245 | 80.0 | 0.793878 | 11.279592 | 2.799320 | 2.761224 | 7.008163 | 4.229252 | 2.187755 | 4.123129 |
std | 9.135373 | 403.509100 | 8.106864 | 1.024165 | 0.0 | 602.024335 | 1.093082 | 20.329428 | 0.711561 | 1.106940 | ... | 1.081209 | 0.0 | 0.852077 | 7.780782 | 1.289271 | 0.706476 | 6.126525 | 3.623137 | 3.222430 | 3.568136 |
min | 18.000000 | 102.000000 | 1.000000 | 1.000000 | 1.0 | 1.000000 | 1.000000 | 30.000000 | 1.000000 | 1.000000 | ... | 1.000000 | 80.0 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 30.000000 | 465.000000 | 2.000000 | 2.000000 | 1.0 | 491.250000 | 2.000000 | 48.000000 | 2.000000 | 1.000000 | ... | 2.000000 | 80.0 | 0.000000 | 6.000000 | 2.000000 | 2.000000 | 3.000000 | 2.000000 | 0.000000 | 2.000000 |
50% | 36.000000 | 802.000000 | 7.000000 | 3.000000 | 1.0 | 1020.500000 | 3.000000 | 66.000000 | 3.000000 | 2.000000 | ... | 3.000000 | 80.0 | 1.000000 | 10.000000 | 3.000000 | 3.000000 | 5.000000 | 3.000000 | 1.000000 | 3.000000 |
75% | 43.000000 | 1157.000000 | 14.000000 | 4.000000 | 1.0 | 1555.750000 | 4.000000 | 83.750000 | 3.000000 | 3.000000 | ... | 4.000000 | 80.0 | 1.000000 | 15.000000 | 3.000000 | 3.000000 | 9.000000 | 7.000000 | 3.000000 | 7.000000 |
max | 60.000000 | 1499.000000 | 29.000000 | 5.000000 | 1.0 | 2068.000000 | 4.000000 | 100.000000 | 4.000000 | 5.000000 | ... | 4.000000 | 80.0 | 3.000000 | 40.000000 | 6.000000 | 4.000000 | 40.000000 | 18.000000 | 15.000000 | 17.000000 |
- We can observe that the average age of the company is around 37 years and that the average number of years in the company is 7. Later we will go deeper into these and other features.
cat_columns = [cname for cname in employee_df.columns if employee_df[cname].dtype == "object"] for i in cat_columns: if i != 'Date' and i!= 'Time': print("%s has %d elements: %s"%(i, len(employee_df[i].unique().tolist()), employee_df[i].unique().tolist()))
Attrition has 2 elements: ['Yes', 'No'] BusinessTravel has 3 elements: ['Travel_Rarely', 'Travel_Frequently', 'Non-Travel'] Department has 3 elements: ['Sales', 'Research & Development', 'Human Resources'] EducationField has 6 elements: ['Life Sciences', 'Other', 'Medical', 'Marketing', 'Technical Degree', 'Human Resources'] Gender has 2 elements: ['Female', 'Male'] JobRole has 9 elements: ['Sales Executive', 'Research Scientist', 'Laboratory Technician', 'Manufacturing Director', 'Healthcare Representative', 'Manager', 'Sales Representative', 'Research Director', 'Human Resources'] MaritalStatus has 3 elements: ['Single', 'Married', 'Divorced'] Over18 has 1 elements: ['Y'] OverTime has 2 elements: ['Yes', 'No']
2 – Dataset visualization
We replace the ‘Attrition’, ‘Over18’ and ‘overtime’ columns by integers before we can carry out any visualization, because those features are binaries strings (‘Yes/No’)
employee_df['Attrition'] =employee_df['Attrition'].apply(lambda x: 1 if x== 'Yes' else 0) employee_df['Over18'] = employee_df['Over18'].apply(lambda x: 1 if x=='Y' else 0) employee_df['OverTime'] = employee_df['OverTime']. apply(lambda x: 1 if x=='Yes' else 0) employee_df.head()
Age | Attrition | BusinessTravel | DailyRate | Department | DistanceFromHome | Education | EducationField | EmployeeCount | EmployeeNumber | ... | RelationshipSatisfaction | StandardHours | StockOptionLevel | TotalWorkingYears | TrainingTimesLastYear | WorkLifeBalance | YearsAtCompany | YearsInCurrentRole | YearsSinceLastPromotion | YearsWithCurrManager | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 41 | 1 | Travel_Rarely | 1102 | Sales | 1 | 2 | Life Sciences | 1 | 1 | ... | 1 | 80 | 0 | 8 | 0 | 1 | 6 | 4 | 0 | 5 |
1 | 49 | 0 | Travel_Frequently | 279 | Research & Development | 8 | 1 | Life Sciences | 1 | 2 | ... | 4 | 80 | 1 | 10 | 3 | 3 | 10 | 7 | 1 | 7 |
2 | 37 | 1 | Travel_Rarely | 1373 | Research & Development | 2 | 2 | Other | 1 | 4 | ... | 2 | 80 | 0 | 7 | 3 | 3 | 0 | 0 | 0 | 0 |
3 | 33 | 0 | Travel_Frequently | 1392 | Research & Development | 3 | 4 | Life Sciences | 1 | 5 | ... | 3 | 80 | 0 | 8 | 3 | 3 | 8 | 7 | 3 | 0 |
4 | 27 | 0 | Travel_Rarely | 591 | Research & Development | 2 | 1 | Medical | 1 | 7 | ... | 4 | 80 | 1 | 6 | 3 | 3 | 2 | 2 | 2 | 2 |
– Let’s see if we’re missing data. We’ll use a heatmap:
sns.heatmap(employee_df.isnull(), yticklabels=False, cbar=False, cmap = "Blues")
employee_df.hist(bins=30, figsize=(20,20), color='r')
-
Some features like ‘MonthlyIncome’ and ‘TotalWorkingYears’ have a distribution with a very long tail (long tail distribution)
-
It makes sense to remove ‘EmployeeCount’, ‘Standardhours’ and ‘Over18’, they are fields that don’t change from one employee to another one
-
We also remove ‘EmployeeNumber’, it belongs to a field to identify employees, it doesn’t have any effect in our study.
employee_df.drop(["EmployeeCount", "StandardHours", "Over18", "EmployeeNumber"], axis=1, inplace = True) employee_df.shape
(1470, 31)
– Let’s see how many employees left and stayed in the company:
left_df = employee_df[employee_df['Attrition']== 1] stay_df = employee_df[employee_df['Attrition']== 0] print('Left', round(employee_df['Attrition'].value_counts()[1]/len(employee_df) * 100,2), '% of the dataset') print('Stay', round(employee_df['Attrition'].value_counts()[0]/len(employee_df) * 100,2), '% of the dataset')
Left 16.12 % of the dataset Stay 83.88 % of the dataset
We are facing an unbalanced dataset.
– Now we’re going to compare the employees’ mean and standard error between those who have left and who have stayed:
left_df.describe()
Age | Attrition | DailyRate | DistanceFromHome | Education | EnvironmentSatisfaction | HourlyRate | JobInvolvement | JobLevel | JobSatisfaction | ... | PerformanceRating | RelationshipSatisfaction | StockOptionLevel | TotalWorkingYears | TrainingTimesLastYear | WorkLifeBalance | YearsAtCompany | YearsInCurrentRole | YearsSinceLastPromotion | YearsWithCurrManager | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 237.000000 | 237.0 | 237.000000 | 237.000000 | 237.000000 | 237.000000 | 237.000000 | 237.000000 | 237.000000 | 237.000000 | ... | 237.000000 | 237.000000 | 237.000000 | 237.000000 | 237.000000 | 237.000000 | 237.000000 | 237.000000 | 237.000000 | 237.000000 |
mean | 33.607595 | 1.0 | 750.362869 | 10.632911 | 2.839662 | 2.464135 | 65.573840 | 2.518987 | 1.637131 | 2.468354 | ... | 3.156118 | 2.599156 | 0.527426 | 8.244726 | 2.624473 | 2.658228 | 5.130802 | 2.902954 | 1.945148 | 2.852321 |
std | 9.689350 | 0.0 | 401.899519 | 8.452525 | 1.008244 | 1.169791 | 20.099958 | 0.773405 | 0.940594 | 1.118058 | ... | 0.363735 | 1.125437 | 0.856361 | 7.169204 | 1.254784 | 0.816453 | 5.949984 | 3.174827 | 3.153077 | 3.143349 |
min | 18.000000 | 1.0 | 103.000000 | 1.000000 | 1.000000 | 1.000000 | 31.000000 | 1.000000 | 1.000000 | 1.000000 | ... | 3.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 28.000000 | 1.0 | 408.000000 | 3.000000 | 2.000000 | 1.000000 | 50.000000 | 2.000000 | 1.000000 | 1.000000 | ... | 3.000000 | 2.000000 | 0.000000 | 3.000000 | 2.000000 | 2.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 |
50% | 32.000000 | 1.0 | 699.000000 | 9.000000 | 3.000000 | 3.000000 | 66.000000 | 3.000000 | 1.000000 | 3.000000 | ... | 3.000000 | 3.000000 | 0.000000 | 7.000000 | 2.000000 | 3.000000 | 3.000000 | 2.000000 | 1.000000 | 2.000000 |
75% | 39.000000 | 1.0 | 1092.000000 | 17.000000 | 4.000000 | 4.000000 | 84.000000 | 3.000000 | 2.000000 | 3.000000 | ... | 3.000000 | 4.000000 | 1.000000 | 10.000000 | 3.000000 | 3.000000 | 7.000000 | 4.000000 | 2.000000 | 5.000000 |
max | 58.000000 | 1.0 | 1496.000000 | 29.000000 | 5.000000 | 4.000000 | 100.000000 | 4.000000 | 5.000000 | 4.000000 | ... | 4.000000 | 4.000000 | 3.000000 | 40.000000 | 6.000000 | 4.000000 | 40.000000 | 15.000000 | 15.000000 | 14.000000 |
stay_df.describe()
Age | Attrition | DailyRate | DistanceFromHome | Education | EnvironmentSatisfaction | HourlyRate | JobInvolvement | JobLevel | JobSatisfaction | ... | PerformanceRating | RelationshipSatisfaction | StockOptionLevel | TotalWorkingYears | TrainingTimesLastYear | WorkLifeBalance | YearsAtCompany | YearsInCurrentRole | YearsSinceLastPromotion | YearsWithCurrManager | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 1233.000000 | 1233.0 | 1233.000000 | 1233.000000 | 1233.000000 | 1233.000000 | 1233.000000 | 1233.000000 | 1233.000000 | 1233.000000 | ... | 1233.000000 | 1233.000000 | 1233.000000 | 1233.000000 | 1233.000000 | 1233.000000 | 1233.000000 | 1233.000000 | 1233.000000 | 1233.000000 |
mean | 37.561233 | 0.0 | 812.504461 | 8.915653 | 2.927007 | 2.771290 | 65.952149 | 2.770479 | 2.145985 | 2.778589 | ... | 3.153285 | 2.733982 | 0.845093 | 11.862936 | 2.832928 | 2.781022 | 7.369019 | 4.484185 | 2.234388 | 4.367397 |
std | 8.888360 | 0.0 | 403.208379 | 8.012633 | 1.027002 | 1.071132 | 20.380754 | 0.692050 | 1.117933 | 1.093277 | ... | 0.360408 | 1.071603 | 0.841985 | 7.760719 | 1.293585 | 0.681907 | 6.096298 | 3.649402 | 3.234762 | 3.594116 |
min | 18.000000 | 0.0 | 102.000000 | 1.000000 | 1.000000 | 1.000000 | 30.000000 | 1.000000 | 1.000000 | 1.000000 | ... | 3.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 31.000000 | 0.0 | 477.000000 | 2.000000 | 2.000000 | 2.000000 | 48.000000 | 2.000000 | 1.000000 | 2.000000 | ... | 3.000000 | 2.000000 | 0.000000 | 6.000000 | 2.000000 | 2.000000 | 3.000000 | 2.000000 | 0.000000 | 2.000000 |
50% | 36.000000 | 0.0 | 817.000000 | 7.000000 | 3.000000 | 3.000000 | 66.000000 | 3.000000 | 2.000000 | 3.000000 | ... | 3.000000 | 3.000000 | 1.000000 | 10.000000 | 3.000000 | 3.000000 | 6.000000 | 3.000000 | 1.000000 | 3.000000 |
75% | 43.000000 | 0.0 | 1176.000000 | 13.000000 | 4.000000 | 4.000000 | 83.000000 | 3.000000 | 3.000000 | 4.000000 | ... | 3.000000 | 4.000000 | 1.000000 | 16.000000 | 3.000000 | 3.000000 | 10.000000 | 7.000000 | 3.000000 | 7.000000 |
max | 60.000000 | 0.0 | 1499.000000 | 29.000000 | 5.000000 | 4.000000 | 100.000000 | 4.000000 | 5.000000 | 4.000000 | ... | 4.000000 | 4.000000 | 3.000000 | 38.000000 | 6.000000 | 4.000000 | 37.000000 | 18.000000 | 15.000000 | 17.000000 |
- ‘age’: Employees
average age who stayed is higher than those who left (
37vs.
33`) - ‘DailyRate’: Employees` daily rate who stayed is higher.
- ‘DistanceFromHome’: The employees who stayed in the company live closer to work.
- ‘EnvironmentSatisfaction’ y ‘JobSatisfaction’: Most of the employees who stayed are more satisfied with their jobs.
- ‘StockOptionLevel’: The employees who stayed have a high level of stocks options.
- ‘OverTime’: The employees who left they worked almost double overtime.
Correlations between variables:
correlations = employee_df.corr() f, ax = plt.subplots(figsize = (20,20)) sns.heatmap(correlations, annot=True)
- ‘Job level’ is highly correlated with the total number of working hours.
- ‘Monthly income’ is highly correlated with Job level and with the total number of working hours.
- ‘Age’ is highly correlated with the Monthly income.
We compare the distributions:
Age vs. Attrition
plt.figure(figsize=(25,12)) sns.countplot(x='Age', hue='Attrition', data= employee_df)
- The people are most likely to leave the company between 26 and 35 yeas old.
(JobRole & MaritalStatus & JobInvolvement & JobLevel) vs. Attrition
plt.figure(figsize=(20,20)) plt.subplot(411) sns.countplot(x='JobRole', hue='Attrition', data= employee_df) plt.subplot(412) sns.countplot(x='MaritalStatus', hue='Attrition', data= employee_df) plt.subplot(413) sns.countplot(x='JobInvolvement', hue='Attrition', data= employee_df) plt.subplot(414) sns.countplot(x='JobLevel', hue='Attrition', data= employee_df)
- Single employees tend to leave the company compared to married and divorced employees.
- Sales Representative us the section the most mobility than the others sections.
- Less involved employees tend to leave the company.
- The employees with a low level tend to leave the company.
Probability density estimation:
Distance from Home vs. Attrition
plt.figure(figsize=(12,7)) sns.kdeplot(left_df['DistanceFromHome'], label = 'Leave the company', shade=True, color='r') sns.kdeplot(stay_df['DistanceFromHome'], label = 'Stay in the company', shade=True, color='b') plt.xlabel('Distance from home to work')
- In the 10-28 Km range we can see that it could exist some correlation variable between the employees who left ande who stay in the company.
plt.figure(figsize=(12,7)) sns.kdeplot(left_df['YearsWithCurrManager'], label = 'Leave the company', shade=True, color='r') sns.kdeplot(stay_df['YearsWithCurrManager'], label = 'Stay in the company', shade=True, color='b') plt.xlabel("Current Manager's years")
- The less time they are with the same manager, they tend to leave the company more often than if they have been with the same manager for more years.
plt.figure(figsize=(12,7)) sns.kdeplot(left_df['TotalWorkingYears'], label = 'Leave the company', shade=True, color='r') sns.kdeplot(stay_df['TotalWorkingYears'], label = 'Stay in the company', shade=True, color='b') plt.xlabel('Total working years')
- From 10 years the employees tend to stay.
Gender vs. Monthly Income
plt.figure(figsize=(10,8)) sns.boxplot(x='MonthlyIncome', y='Gender', data=employee_df)
- This company doesn’t have wage inequity by gender.
Monthly Income vs. Job Role
plt.figure(figsize=(10,8)) sns.boxplot(x='MonthlyIncome', y='JobRole', data=employee_df)
- Managers and Research Directors earn more money than the others roles.
- Scientist and Laboratory Technician are poorly paid.
- There’s a lot difference between high positions and middle-low positions.
3 – Test and training dataset
employee_df.head()
Age | Attrition | BusinessTravel | DailyRate | Department | DistanceFromHome | Education | EducationField | EnvironmentSatisfaction | Gender | ... | PerformanceRating | RelationshipSatisfaction | StockOptionLevel | TotalWorkingYears | TrainingTimesLastYear | WorkLifeBalance | YearsAtCompany | YearsInCurrentRole | YearsSinceLastPromotion | YearsWithCurrManager | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 41 | 1 | Travel_Rarely | 1102 | Sales | 1 | 2 | Life Sciences | 2 | Female | ... | 3 | 1 | 0 | 8 | 0 | 1 | 6 | 4 | 0 | 5 |
1 | 49 | 0 | Travel_Frequently | 279 | Research & Development | 8 | 1 | Life Sciences | 3 | Male | ... | 4 | 4 | 1 | 10 | 3 | 3 | 10 | 7 | 1 | 7 |
2 | 37 | 1 | Travel_Rarely | 1373 | Research & Development | 2 | 2 | Other | 4 | Male | ... | 3 | 2 | 0 | 7 | 3 | 3 | 0 | 0 | 0 | 0 |
3 | 33 | 0 | Travel_Frequently | 1392 | Research & Development | 3 | 4 | Life Sciences | 4 | Female | ... | 3 | 3 | 0 | 8 | 3 | 3 | 8 | 7 | 3 | 0 |
4 | 27 | 0 | Travel_Rarely | 591 | Research & Development | 2 | 1 | Medical | 1 | Male | ... | 3 | 4 | 1 | 6 | 3 | 3 | 2 | 2 | 2 | 2 |
– Our categories:
X_cat=employee_df[['BusinessTravel', 'Department', 'EducationField', 'Gender', 'JobRole', 'MaritalStatus']] X_cat
BusinessTravel | Department | EducationField | Gender | JobRole | MaritalStatus | |
---|---|---|---|---|---|---|
0 | Travel_Rarely | Sales | Life Sciences | Female | Sales Executive | Single |
1 | Travel_Frequently | Research & Development | Life Sciences | Male | Research Scientist | Married |
2 | Travel_Rarely | Research & Development | Other | Male | Laboratory Technician | Single |
3 | Travel_Frequently | Research & Development | Life Sciences | Female | Research Scientist | Married |
4 | Travel_Rarely | Research & Development | Medical | Male | Laboratory Technician | Married |
... | ... | ... | ... | ... | ... | ... |
1465 | Travel_Frequently | Research & Development | Medical | Male | Laboratory Technician | Married |
1466 | Travel_Rarely | Research & Development | Medical | Male | Healthcare Representative | Married |
1467 | Travel_Rarely | Research & Development | Life Sciences | Male | Manufacturing Director | Married |
1468 | Travel_Frequently | Sales | Medical | Male | Sales Executive | Married |
1469 | Travel_Rarely | Research & Development | Medical | Male | Laboratory Technician | Married |
from sklearn.preprocessing import OneHotEncoder onehotencoder = OneHotEncoder() X_cat = onehotencoder.fit_transform(X_cat).toarray() X_cat.shape
(1470, 26)
X_cat= pd.DataFrame(X_cat) X_cat
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 |
1 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
2 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
3 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
4 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | ... | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1465 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | ... | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
1466 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
1467 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
1468 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 |
1469 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | ... | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
– We only take the numerical variables:
numerical_columns = [cname for cname in employee_df.columns if (employee_df[cname].dtype == "int64" and cname!='Attrition')] X_numerical= employee_df[numerical_columns] X_numerical
Age | DailyRate | DistanceFromHome | Education | EnvironmentSatisfaction | HourlyRate | JobInvolvement | JobLevel | JobSatisfaction | MonthlyIncome | ... | PerformanceRating | RelationshipSatisfaction | StockOptionLevel | TotalWorkingYears | TrainingTimesLastYear | WorkLifeBalance | YearsAtCompany | YearsInCurrentRole | YearsSinceLastPromotion | YearsWithCurrManager | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 41 | 1102 | 1 | 2 | 2 | 94 | 3 | 2 | 4 | 5993 | ... | 3 | 1 | 0 | 8 | 0 | 1 | 6 | 4 | 0 | 5 |
1 | 49 | 279 | 8 | 1 | 3 | 61 | 2 | 2 | 2 | 5130 | ... | 4 | 4 | 1 | 10 | 3 | 3 | 10 | 7 | 1 | 7 |
2 | 37 | 1373 | 2 | 2 | 4 | 92 | 2 | 1 | 3 | 2090 | ... | 3 | 2 | 0 | 7 | 3 | 3 | 0 | 0 | 0 | 0 |
3 | 33 | 1392 | 3 | 4 | 4 | 56 | 3 | 1 | 3 | 2909 | ... | 3 | 3 | 0 | 8 | 3 | 3 | 8 | 7 | 3 | 0 |
4 | 27 | 591 | 2 | 1 | 1 | 40 | 3 | 1 | 2 | 3468 | ... | 3 | 4 | 1 | 6 | 3 | 3 | 2 | 2 | 2 | 2 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1465 | 36 | 884 | 23 | 2 | 3 | 41 | 4 | 2 | 4 | 2571 | ... | 3 | 3 | 1 | 17 | 3 | 3 | 5 | 2 | 0 | 3 |
1466 | 39 | 613 | 6 | 1 | 4 | 42 | 2 | 3 | 1 | 9991 | ... | 3 | 1 | 1 | 9 | 5 | 3 | 7 | 7 | 1 | 7 |
1467 | 27 | 155 | 4 | 3 | 2 | 87 | 4 | 2 | 2 | 6142 | ... | 4 | 2 | 1 | 6 | 0 | 3 | 6 | 2 | 0 | 3 |
1468 | 49 | 1023 | 2 | 3 | 4 | 63 | 2 | 2 | 2 | 5390 | ... | 3 | 4 | 0 | 17 | 3 | 2 | 9 | 6 | 0 | 8 |
1469 | 34 | 628 | 8 | 3 | 2 | 82 | 4 | 2 | 3 | 4404 | ... | 3 | 1 | 0 | 6 | 3 | 4 | 4 | 3 | 1 | 2 |
– We join categorical and numercial tables (without Attrition, our target):
X_all = pd.concat([X_cat, X_numerical], axis=1) X_all
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | PerformanceRating | RelationshipSatisfaction | StockOptionLevel | TotalWorkingYears | TrainingTimesLastYear | WorkLifeBalance | YearsAtCompany | YearsInCurrentRole | YearsSinceLastPromotion | YearsWithCurrManager | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | ... | 3 | 1 | 0 | 8 | 0 | 1 | 6 | 4 | 0 | 5 |
1 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | ... | 4 | 4 | 1 | 10 | 3 | 3 | 10 | 7 | 1 | 7 |
2 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 3 | 2 | 0 | 7 | 3 | 3 | 0 | 0 | 0 | 0 |
3 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | ... | 3 | 3 | 0 | 8 | 3 | 3 | 8 | 7 | 3 | 0 |
4 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | ... | 3 | 4 | 1 | 6 | 3 | 3 | 2 | 2 | 2 | 2 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1465 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | ... | 3 | 3 | 1 | 17 | 3 | 3 | 5 | 2 | 0 | 3 |
1466 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | ... | 3 | 1 | 1 | 9 | 5 | 3 | 7 | 7 | 1 | 7 |
1467 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | ... | 4 | 2 | 1 | 6 | 0 | 3 | 6 | 2 | 0 | 3 |
1468 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | ... | 3 | 4 | 0 | 17 | 3 | 2 | 9 | 6 | 0 | 8 |
1469 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | ... | 3 | 1 | 0 | 6 | 3 | 4 | 4 | 3 | 1 | 2 |
Reescale the variables
from sklearn.preprocessing import MinMaxScaler scaler = MinMaxScaler() X = scaler.fit_transform(X_all) X
array([[0. , 0. , 1. , ..., 0.22222222, 0. , 0.29411765], [0. , 1. , 0. , ..., 0.38888889, 0.06666667, 0.41176471], [0. , 0. , 1. , ..., 0. , 0. , 0. ], ..., [0. , 0. , 1. , ..., 0.11111111, 0. , 0.17647059], [0. , 1. , 0. , ..., 0.33333333, 0. , 0.47058824], [0. , 0. , 1. , ..., 0.16666667, 0.06666667, 0.11764706]])
– Our target variable:
y = employee_df['Attrition'] y
0 1 1 0 2 1 3 0 4 0 .. 1465 0 1466 0 1467 0 1468 0 1469 0 Name: Attrition, Length: 1470, dtype: int64
4 – Training and evaluate a classifier using a Logistic Regression
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25) #25% of the dataset to train.
X_train.shape
(1102, 50)
X_test.shape
(368, 50)
model = LogisticRegression() model.fit(X_train, y_train)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, l1_ratio=None, max_iter=100, multi_class='auto', n_jobs=None, penalty='l2', random_state=None, solver='lbfgs', tol=0.0001, verbose=0, warm_start=False)
– We use test data to predict:
y_pred = model.predict(X_test) y_pred
array([0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1])
- 1 –> The employee leaves
- 0 –> The employe stay
Let ‘s see our accuracy
from sklearn.metrics import confusion_matrix, classification_report print("Accuracy: {}".format(100*accuracy_score(y_pred, y_test)))
Accuracy: 85.86956521739131
cm = confusion_matrix(y_test, y_pred) sns.heatmap(cm, annot=True)
We get a good accuracy(=88%), but we have to check the rest of the paramenters yet.
We analyze the Precision , Recall and F1-Score:
print(classification_report(y_test, y_pred))
precision recall f1-score support 0 0.87 0.98 0.92 304 1 0.73 0.30 0.42 64 accuracy 0.86 368 macro avg 0.80 0.64 0.67 368 weighted avg 0.84 0.86 0.83 368
We get good results for the 0
category, but for the other one is bad.
5 – Training and evaluate a clasiffier using Random Forest
from sklearn.ensemble import RandomForestClassifier model = RandomForestClassifier() model.fit(X_train, y_train)
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None, criterion='gini', max_depth=None, max_features='auto', max_leaf_nodes=None, max_samples=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None, oob_score=False, random_state=None, verbose=0, warm_start=False)
y_pred = model.predict(X_test)
cm = confusion_matrix(y_test, y_pred) sns.heatmap(cm, annot=True)
print(classification_report(y_test, y_pred))
precision recall f1-score support 0 0.84 0.99 0.91 304 1 0.64 0.11 0.19 64 accuracy 0.83 368 macro avg 0.74 0.55 0.55 368 weighted avg 0.80 0.83 0.78 368
We continue to obtain poor results for 1
.
6 – Training and evaluate a clasiffier using Deep Learning
import tensorflow as tf model = tf.keras.models.Sequential() model.add(tf.keras.layers.Dense(units = 500, activation = 'relu', input_shape=(50, ))) model.add(tf.keras.layers.Dense(units = 500, activation = 'relu')) model.add(tf.keras.layers.Dense(units = 500, activation = 'relu')) model.add(tf.keras.layers.Dense(units = 1, activation = 'sigmoid')) model.summary()
Model: "sequential_1" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= dense_4 (Dense) (None, 500) 25500 _________________________________________________________________ dense_5 (Dense) (None, 500) 250500 _________________________________________________________________ dense_6 (Dense) (None, 500) 250500 _________________________________________________________________ dense_7 (Dense) (None, 1) 501 ================================================================= Total params: 527,001 Trainable params: 527,001 Non-trainable params: 0 _________________________________________________________________
model.compile(optimizer='Adam', loss = 'binary_crossentropy', metrics=['accuracy']) # oversampler = SMOTE(random_state=0) # smote_train, smote_target = oversampler.fit_sample(X_train, y_train) # epochs_hist = model.fit(smote_train, smote_target, epochs = 100, batch_size = 50) epochs_hist = model.fit(X_train, y_train, epochs =100, batch_size = 50)
Epoch 1/100 23/23 [==============================] - 0s 6ms/step - loss: 0.4236 - accuracy: 0.8385 Epoch 2/100 23/23 [==============================] - 0s 6ms/step - loss: 0.3598 - accuracy: 0.8666 Epoch 3/100 23/23 [==============================] - 0s 6ms/step - loss: 0.2984 - accuracy: 0.8820 Epoch 4/100 23/23 [==============================] - 0s 9ms/step - loss: 0.2969 - accuracy: 0.8811 Epoch 5/100 23/23 [==============================] - 0s 7ms/step - loss: 0.2527 - accuracy: 0.9111 ................................................................................... Epoch 95/100 23/23 [==============================] - 0s 6ms/step - loss: 2.0451e-06 - accuracy: 1.0000 Epoch 96/100 23/23 [==============================] - 0s 6ms/step - loss: 1.9649e-06 - accuracy: 1.0000 Epoch 97/100 23/23 [==============================] - 0s 5ms/step - loss: 1.8953e-06 - accuracy: 1.0000 Epoch 98/100 23/23 [==============================] - 0s 5ms/step - loss: 1.8310e-06 - accuracy: 1.0000 Epoch 99/100 23/23 [==============================] - 0s 5ms/step - loss: 1.7649e-06 - accuracy: 1.0000 Epoch 100/100 23/23 [==============================] - 0s 5ms/step - loss: 1.7036e-06 - accuracy: 1.0000 Non-trainable params: 0 _________________________________________________________________
y_pred = model.predict(X_test) y_pred
array([[1.02000641e-09], [1.00000000e+00], [2.50559333e-06], [9.55128326e-06], [1.95358858e-08], [1.54894892e-07], [9.45388039e-12], [2.08851762e-14], [1.06263491e-10], [3.86129813e-08], [8.14706087e-04], [7.68147324e-10], [3.01722114e-14], [7.42019329e-05], ...
This is the probability if the employee leaves the company.
– We are going to make a filter. If this indicator is above 0.5% the employee will leave:
y_pred = (y_pred>0.5) y_pred
array([[False], [ True], [False], [False], [False], [False], [False], [False], [False], [False], [False], [False], [False], [False], [False], [False], [False], [ True], [False], [False], [False], ...
epochs_hist.history.keys()
dict_keys(['loss', 'accuracy'])
plt.plot(epochs_hist.history['loss']) plt.title('Función de pérdidas del Modelo durante el entrenamiento') plt.xlabel('Epochs') plt.ylabel('Error de entrenamiento') plt.legend(["Error de entrenamiento"])
plt.plot(epochs_hist.history['accuracy']) plt.title('Tasa de acuerto del Modelo durante el entrenamiento') plt.xlabel('Epochs') plt.ylabel('Accuracy de entrenamiento') plt.legend(["Accuracy de entrenamiento"])
cm =confusion_matrix(y_test, y_pred) sns.heatmap(cm, annot=True)
print(classification_report(y_test, y_pred))
precision recall f1-score support 0 0.87 0.94 0.90 304 1 0.53 0.31 0.39 64 accuracy 0.83 368 macro avg 0.70 0.63 0.65 368 weighted avg 0.81 0.83 0.81 368
Our accuracy is still poor.