Model for HR department

[This work is based on this course: Data Science for Business | 6 Real-world Case Studies.]

When running a business, it takes money to make money. You make spending decisions every day for the better moment of your company. One of the most important investments you can make is in new people.

  • Hiring takes a lot of skills, patience, time and money.Most of all, small business owners spend around 40% of their working hours on tasks that do not generate income, such as hiring.

  • Companies spend from 15% to 20% of the employee’s annual salary, with more senior positions on the higher end of the scale.

  • It’s known that an average company loses anywhere between 1% and 2.5% of their total revenue on the time it takes to bring a new hire up to speed.

  • Hiring an employee in a company with 0-500 people costs an average of $7,645.

  • The average company in the United States spends about $4,000 to hire a new employee, taking up to 52 days to fill a position.

Our goal is to predict which employees are most likely to quit their job and a dataset has been provided to us.

Sources:

1 – Import the libraries and look at the dataset.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
employee_df=pd.read_csv("Human_Resources.csv")
employee_df.head()
AgeAttritionBusinessTravelDailyRateDepartmentDistanceFromHomeEducationEducationFieldEmployeeCountEmployeeNumber...RelationshipSatisfactionStandardHoursStockOptionLevelTotalWorkingYearsTrainingTimesLastYearWorkLifeBalanceYearsAtCompanyYearsInCurrentRoleYearsSinceLastPromotionYearsWithCurrManager
041YesTravel_Rarely1102Sales12Life Sciences11...18008016405
149NoTravel_Frequently279Research & Development81Life Sciences12...4801103310717
237YesTravel_Rarely1373Research & Development22Other14...28007330000
333NoTravel_Frequently1392Research & Development34Life Sciences15...38008338730
427NoTravel_Rarely591Research & Development21Medical17...48016332222
  • We need to predict the Attrition’s feature.
employee_df.columns
Index(['Age', 'Attrition', 'BusinessTravel', 'DailyRate', 'Department',
       'DistanceFromHome', 'Education', 'EducationField', 'EmployeeCount',
       'EmployeeNumber', 'EnvironmentSatisfaction', 'Gender', 'HourlyRate',
       'JobInvolvement', 'JobLevel', 'JobRole', 'JobSatisfaction',
       'MaritalStatus', 'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked',
       'Over18', 'OverTime', 'PercentSalaryHike', 'PerformanceRating',
       'RelationshipSatisfaction', 'StandardHours', 'StockOptionLevel',
       'TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance',
       'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion',
       'YearsWithCurrManager'],
      dtype='object')
employee_df.info()
# 35 features in total with 1470 data points each
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 35 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Age                       1470 non-null   int64 
 1   Attrition                 1470 non-null   object
 2   BusinessTravel            1470 non-null   object
 3   DailyRate                 1470 non-null   int64 
 4   Department                1470 non-null   object
 5   DistanceFromHome          1470 non-null   int64 
 6   Education                 1470 non-null   int64 
 7   EducationField            1470 non-null   object
 8   EmployeeCount             1470 non-null   int64 
 9   EmployeeNumber            1470 non-null   int64 
 10  EnvironmentSatisfaction   1470 non-null   int64 
 11  Gender                    1470 non-null   object
 12  HourlyRate                1470 non-null   int64 
 13  JobInvolvement            1470 non-null   int64 
 14  JobLevel                  1470 non-null   int64 
 15  JobRole                   1470 non-null   object
 16  JobSatisfaction           1470 non-null   int64 
 17  MaritalStatus             1470 non-null   object
 18  MonthlyIncome             1470 non-null   int64 
 19  MonthlyRate               1470 non-null   int64 
 20  NumCompaniesWorked        1470 non-null   int64 
 21  Over18                    1470 non-null   object
 22  OverTime                  1470 non-null   object
 23  PercentSalaryHike         1470 non-null   int64 
 24  PerformanceRating         1470 non-null   int64 
 25  RelationshipSatisfaction  1470 non-null   int64 
 26  StandardHours             1470 non-null   int64 
 27  StockOptionLevel          1470 non-null   int64 
 28  TotalWorkingYears         1470 non-null   int64 
 29  TrainingTimesLastYear     1470 non-null   int64 
 30  WorkLifeBalance           1470 non-null   int64 
 31  YearsAtCompany            1470 non-null   int64 
 32  YearsInCurrentRole        1470 non-null   int64 
 33  YearsSinceLastPromotion   1470 non-null   int64 
 34  YearsWithCurrManager      1470 non-null   int64 
dtypes: int64(26), object(9)
memory usage: 402.1+ KB
employee_df.describe()
AgeDailyRateDistanceFromHomeEducationEmployeeCountEmployeeNumberEnvironmentSatisfactionHourlyRateJobInvolvementJobLevel...RelationshipSatisfactionStandardHoursStockOptionLevelTotalWorkingYearsTrainingTimesLastYearWorkLifeBalanceYearsAtCompanyYearsInCurrentRoleYearsSinceLastPromotionYearsWithCurrManager
count1470.0000001470.0000001470.0000001470.0000001470.01470.0000001470.0000001470.0000001470.0000001470.000000...1470.0000001470.01470.0000001470.0000001470.0000001470.0000001470.0000001470.0000001470.0000001470.000000
mean36.923810802.4857149.1925172.9129251.01024.8653062.72176965.8911562.7299322.063946...2.71224580.00.79387811.2795922.7993202.7612247.0081634.2292522.1877554.123129
std9.135373403.5091008.1068641.0241650.0602.0243351.09308220.3294280.7115611.106940...1.0812090.00.8520777.7807821.2892710.7064766.1265253.6231373.2224303.568136
min18.000000102.0000001.0000001.0000001.01.0000001.00000030.0000001.0000001.000000...1.00000080.00.0000000.0000000.0000001.0000000.0000000.0000000.0000000.000000
25%30.000000465.0000002.0000002.0000001.0491.2500002.00000048.0000002.0000001.000000...2.00000080.00.0000006.0000002.0000002.0000003.0000002.0000000.0000002.000000
50%36.000000802.0000007.0000003.0000001.01020.5000003.00000066.0000003.0000002.000000...3.00000080.01.00000010.0000003.0000003.0000005.0000003.0000001.0000003.000000
75%43.0000001157.00000014.0000004.0000001.01555.7500004.00000083.7500003.0000003.000000...4.00000080.01.00000015.0000003.0000003.0000009.0000007.0000003.0000007.000000
max60.0000001499.00000029.0000005.0000001.02068.0000004.000000100.0000004.0000005.000000...4.00000080.03.00000040.0000006.0000004.00000040.00000018.00000015.00000017.000000
  • We can observe that the average age of the company is around 37 years and that the average number of years in the company is 7. Later we will go deeper into these and other features.
cat_columns = [cname for cname in employee_df.columns if employee_df[cname].dtype == "object"]

for i in cat_columns:
    if i != 'Date' and i!= 'Time':
        print("%s has %d elements: %s"%(i, len(employee_df[i].unique().tolist()), employee_df[i].unique().tolist()))
Attrition has 2 elements: ['Yes', 'No']
BusinessTravel has 3 elements: ['Travel_Rarely', 'Travel_Frequently', 'Non-Travel']
Department has 3 elements: ['Sales', 'Research & Development', 'Human Resources']
EducationField has 6 elements: ['Life Sciences', 'Other', 'Medical', 'Marketing', 'Technical Degree', 'Human Resources']
Gender has 2 elements: ['Female', 'Male']
JobRole has 9 elements: ['Sales Executive', 'Research Scientist', 'Laboratory Technician', 'Manufacturing Director', 'Healthcare Representative', 'Manager', 'Sales Representative', 'Research Director', 'Human Resources']
MaritalStatus has 3 elements: ['Single', 'Married', 'Divorced']
Over18 has 1 elements: ['Y']
OverTime has 2 elements: ['Yes', 'No']

2 – Dataset visualization

We replace the ‘Attrition’, ‘Over18’ and ‘overtime’ columns by integers before we can carry out any visualization, because those features are binaries strings (‘Yes/No’)

employee_df['Attrition'] =employee_df['Attrition'].apply(lambda x: 1 if x== 'Yes' else 0)
employee_df['Over18'] = employee_df['Over18'].apply(lambda x: 1 if x=='Y' else 0)
employee_df['OverTime'] = employee_df['OverTime']. apply(lambda x: 1 if x=='Yes' else 0)
employee_df.head()
AgeAttritionBusinessTravelDailyRateDepartmentDistanceFromHomeEducationEducationFieldEmployeeCountEmployeeNumber...RelationshipSatisfactionStandardHoursStockOptionLevelTotalWorkingYearsTrainingTimesLastYearWorkLifeBalanceYearsAtCompanyYearsInCurrentRoleYearsSinceLastPromotionYearsWithCurrManager
0411Travel_Rarely1102Sales12Life Sciences11...18008016405
1490Travel_Frequently279Research & Development81Life Sciences12...4801103310717
2371Travel_Rarely1373Research & Development22Other14...28007330000
3330Travel_Frequently1392Research & Development34Life Sciences15...38008338730
4270Travel_Rarely591Research & Development21Medical17...48016332222

– Let’s see if we’re missing data. We’ll use a heatmap:

sns.heatmap(employee_df.isnull(), yticklabels=False, cbar=False, cmap = "Blues")
employee_df.hist(bins=30, figsize=(20,20), color='r')
  • Some features like ‘MonthlyIncome’ and ‘TotalWorkingYears’ have a distribution with a very long tail (long tail distribution)

  • It makes sense to remove ‘EmployeeCount’, ‘Standardhours’ and ‘Over18’, they are fields that don’t change from one employee to another one

  • We also remove ‘EmployeeNumber’, it belongs to a field to identify employees, it doesn’t have any effect in our study.

employee_df.drop(["EmployeeCount", "StandardHours", "Over18", "EmployeeNumber"], axis=1, inplace = True)
employee_df.shape
(1470, 31)

– Let’s see how many employees left and stayed in the company:

left_df = employee_df[employee_df['Attrition']== 1]
stay_df = employee_df[employee_df['Attrition']== 0]

print('Left', round(employee_df['Attrition'].value_counts()[1]/len(employee_df) * 100,2), '% of the dataset')
print('Stay', round(employee_df['Attrition'].value_counts()[0]/len(employee_df) * 100,2), '% of the dataset')
Left 16.12 % of the dataset
Stay 83.88 % of the dataset

We are facing an unbalanced dataset.

– Now we’re going to compare the employees’ mean and standard error between those who have left and who have stayed:

left_df.describe()
AgeAttritionDailyRateDistanceFromHomeEducationEnvironmentSatisfactionHourlyRateJobInvolvementJobLevelJobSatisfaction...PerformanceRatingRelationshipSatisfactionStockOptionLevelTotalWorkingYearsTrainingTimesLastYearWorkLifeBalanceYearsAtCompanyYearsInCurrentRoleYearsSinceLastPromotionYearsWithCurrManager
count237.000000237.0237.000000237.000000237.000000237.000000237.000000237.000000237.000000237.000000...237.000000237.000000237.000000237.000000237.000000237.000000237.000000237.000000237.000000237.000000
mean33.6075951.0750.36286910.6329112.8396622.46413565.5738402.5189871.6371312.468354...3.1561182.5991560.5274268.2447262.6244732.6582285.1308022.9029541.9451482.852321
std9.6893500.0401.8995198.4525251.0082441.16979120.0999580.7734050.9405941.118058...0.3637351.1254370.8563617.1692041.2547840.8164535.9499843.1748273.1530773.143349
min18.0000001.0103.0000001.0000001.0000001.00000031.0000001.0000001.0000001.000000...3.0000001.0000000.0000000.0000000.0000001.0000000.0000000.0000000.0000000.000000
25%28.0000001.0408.0000003.0000002.0000001.00000050.0000002.0000001.0000001.000000...3.0000002.0000000.0000003.0000002.0000002.0000001.0000000.0000000.0000000.000000
50%32.0000001.0699.0000009.0000003.0000003.00000066.0000003.0000001.0000003.000000...3.0000003.0000000.0000007.0000002.0000003.0000003.0000002.0000001.0000002.000000
75%39.0000001.01092.00000017.0000004.0000004.00000084.0000003.0000002.0000003.000000...3.0000004.0000001.00000010.0000003.0000003.0000007.0000004.0000002.0000005.000000
max58.0000001.01496.00000029.0000005.0000004.000000100.0000004.0000005.0000004.000000...4.0000004.0000003.00000040.0000006.0000004.00000040.00000015.00000015.00000014.000000
stay_df.describe()
AgeAttritionDailyRateDistanceFromHomeEducationEnvironmentSatisfactionHourlyRateJobInvolvementJobLevelJobSatisfaction...PerformanceRatingRelationshipSatisfactionStockOptionLevelTotalWorkingYearsTrainingTimesLastYearWorkLifeBalanceYearsAtCompanyYearsInCurrentRoleYearsSinceLastPromotionYearsWithCurrManager
count1233.0000001233.01233.0000001233.0000001233.0000001233.0000001233.0000001233.0000001233.0000001233.000000...1233.0000001233.0000001233.0000001233.0000001233.0000001233.0000001233.0000001233.0000001233.0000001233.000000
mean37.5612330.0812.5044618.9156532.9270072.77129065.9521492.7704792.1459852.778589...3.1532852.7339820.84509311.8629362.8329282.7810227.3690194.4841852.2343884.367397
std8.8883600.0403.2083798.0126331.0270021.07113220.3807540.6920501.1179331.093277...0.3604081.0716030.8419857.7607191.2935850.6819076.0962983.6494023.2347623.594116
min18.0000000.0102.0000001.0000001.0000001.00000030.0000001.0000001.0000001.000000...3.0000001.0000000.0000000.0000000.0000001.0000000.0000000.0000000.0000000.000000
25%31.0000000.0477.0000002.0000002.0000002.00000048.0000002.0000001.0000002.000000...3.0000002.0000000.0000006.0000002.0000002.0000003.0000002.0000000.0000002.000000
50%36.0000000.0817.0000007.0000003.0000003.00000066.0000003.0000002.0000003.000000...3.0000003.0000001.00000010.0000003.0000003.0000006.0000003.0000001.0000003.000000
75%43.0000000.01176.00000013.0000004.0000004.00000083.0000003.0000003.0000004.000000...3.0000004.0000001.00000016.0000003.0000003.00000010.0000007.0000003.0000007.000000
max60.0000000.01499.00000029.0000005.0000004.000000100.0000004.0000005.0000004.000000...4.0000004.0000003.00000038.0000006.0000004.00000037.00000018.00000015.00000017.000000
  • ‘age’: Employees average age who stayed is higher than those who left (37vs.33`)
  • ‘DailyRate’: Employees` daily rate who stayed is higher.
  • ‘DistanceFromHome’: The employees who stayed in the company live closer to work.
  • ‘EnvironmentSatisfaction’ y ‘JobSatisfaction’: Most of the employees who stayed are more satisfied with their jobs.
  • ‘StockOptionLevel’: The employees who stayed have a high level of stocks options.
  • ‘OverTime’: The employees who left they worked almost double overtime.

Correlations between variables:

correlations = employee_df.corr()
f, ax = plt.subplots(figsize = (20,20))
sns.heatmap(correlations, annot=True)
  • ‘Job level’ is highly correlated with the total number of working hours.
  • ‘Monthly income’ is highly correlated with Job level and with the total number of working hours.
  • ‘Age’ is highly correlated with the Monthly income.

We compare the distributions:

Age vs. Attrition

plt.figure(figsize=(25,12))
sns.countplot(x='Age', hue='Attrition', data= employee_df)
  • The people are most likely to leave the company between 26 and 35 yeas old.

(JobRole & MaritalStatus & JobInvolvement & JobLevel) vs. Attrition

plt.figure(figsize=(20,20))
plt.subplot(411)
sns.countplot(x='JobRole', hue='Attrition', data= employee_df)

plt.subplot(412)
sns.countplot(x='MaritalStatus', hue='Attrition', data= employee_df)

plt.subplot(413)
sns.countplot(x='JobInvolvement', hue='Attrition', data= employee_df)

plt.subplot(414)
sns.countplot(x='JobLevel', hue='Attrition', data= employee_df)
  • Single employees tend to leave the company compared to married and divorced employees.
  • Sales Representative us the section the most mobility than the others sections.
  • Less involved employees tend to leave the company.
  • The employees with a low level tend to leave the company.

Probability density estimation:

Distance from Home vs. Attrition

plt.figure(figsize=(12,7))
sns.kdeplot(left_df['DistanceFromHome'], label = 'Leave the company', shade=True, color='r')
sns.kdeplot(stay_df['DistanceFromHome'], label = 'Stay in the company', shade=True, color='b')

plt.xlabel('Distance from home to work')
  • In the 10-28 Km range we can see that it could exist some correlation variable between the employees who left ande who stay in the company.
plt.figure(figsize=(12,7))
sns.kdeplot(left_df['YearsWithCurrManager'], label = 'Leave the company', shade=True, color='r')
sns.kdeplot(stay_df['YearsWithCurrManager'], label = 'Stay in the company', shade=True, color='b')

plt.xlabel("Current Manager's years")
  • The less time they are with the same manager, they tend to leave the company more often than if they have been with the same manager for more years.
plt.figure(figsize=(12,7))
sns.kdeplot(left_df['TotalWorkingYears'], label = 'Leave the company', shade=True, color='r')
sns.kdeplot(stay_df['TotalWorkingYears'], label = 'Stay in the company', shade=True, color='b')

plt.xlabel('Total working years')
  • From 10 years the employees tend to stay.

Gender vs. Monthly Income

plt.figure(figsize=(10,8))
sns.boxplot(x='MonthlyIncome', y='Gender', data=employee_df)
  • This company doesn’t have wage inequity by gender.

Monthly Income vs. Job Role

plt.figure(figsize=(10,8))
sns.boxplot(x='MonthlyIncome', y='JobRole', data=employee_df)
  • Managers and Research Directors earn more money than the others roles.
  • Scientist and Laboratory Technician are poorly paid.
  • There’s a lot difference between high positions and middle-low positions.

3 – Test and training dataset

employee_df.head()
AgeAttritionBusinessTravelDailyRateDepartmentDistanceFromHomeEducationEducationFieldEnvironmentSatisfactionGender...PerformanceRatingRelationshipSatisfactionStockOptionLevelTotalWorkingYearsTrainingTimesLastYearWorkLifeBalanceYearsAtCompanyYearsInCurrentRoleYearsSinceLastPromotionYearsWithCurrManager
0411Travel_Rarely1102Sales12Life Sciences2Female...3108016405
1490Travel_Frequently279Research & Development81Life Sciences3Male...441103310717
2371Travel_Rarely1373Research & Development22Other4Male...3207330000
3330Travel_Frequently1392Research & Development34Life Sciences4Female...3308338730
4270Travel_Rarely591Research & Development21Medical1Male...3416332222

– Our categories:

X_cat=employee_df[['BusinessTravel', 'Department', 'EducationField', 'Gender', 'JobRole', 'MaritalStatus']]
X_cat
BusinessTravelDepartmentEducationFieldGenderJobRoleMaritalStatus
0Travel_RarelySalesLife SciencesFemaleSales ExecutiveSingle
1Travel_FrequentlyResearch & DevelopmentLife SciencesMaleResearch ScientistMarried
2Travel_RarelyResearch & DevelopmentOtherMaleLaboratory TechnicianSingle
3Travel_FrequentlyResearch & DevelopmentLife SciencesFemaleResearch ScientistMarried
4Travel_RarelyResearch & DevelopmentMedicalMaleLaboratory TechnicianMarried
.....................
1465Travel_FrequentlyResearch & DevelopmentMedicalMaleLaboratory TechnicianMarried
1466Travel_RarelyResearch & DevelopmentMedicalMaleHealthcare RepresentativeMarried
1467Travel_RarelyResearch & DevelopmentLife SciencesMaleManufacturing DirectorMarried
1468Travel_FrequentlySalesMedicalMaleSales ExecutiveMarried
1469Travel_RarelyResearch & DevelopmentMedicalMaleLaboratory TechnicianMarried
from sklearn.preprocessing import OneHotEncoder

onehotencoder = OneHotEncoder()
X_cat = onehotencoder.fit_transform(X_cat).toarray()
X_cat.shape
(1470, 26)
X_cat= pd.DataFrame(X_cat)
X_cat  
0123456789...16171819202122232425
00.00.01.00.00.01.00.01.00.00.0...0.00.00.00.00.01.00.00.00.01.0
10.01.00.00.01.00.00.01.00.00.0...0.00.00.00.01.00.00.00.01.00.0
20.00.01.00.01.00.00.00.00.00.0...1.00.00.00.00.00.00.00.00.01.0
30.01.00.00.01.00.00.01.00.00.0...0.00.00.00.01.00.00.00.01.00.0
40.00.01.00.01.00.00.00.00.01.0...1.00.00.00.00.00.00.00.01.00.0
..................................................................
14650.01.00.00.01.00.00.00.00.01.0...1.00.00.00.00.00.00.00.01.00.0
14660.00.01.00.01.00.00.00.00.01.0...0.00.00.00.00.00.00.00.01.00.0
14670.00.01.00.01.00.00.01.00.00.0...0.00.01.00.00.00.00.00.01.00.0
14680.01.00.00.00.01.00.00.00.01.0...0.00.00.00.00.01.00.00.01.00.0
14690.00.01.00.01.00.00.00.00.01.0...1.00.00.00.00.00.00.00.01.00.0

– We only take the numerical variables:

numerical_columns = [cname for cname in employee_df.columns if (employee_df[cname].dtype == "int64" and cname!='Attrition')]
X_numerical= employee_df[numerical_columns]
X_numerical
AgeDailyRateDistanceFromHomeEducationEnvironmentSatisfactionHourlyRateJobInvolvementJobLevelJobSatisfactionMonthlyIncome...PerformanceRatingRelationshipSatisfactionStockOptionLevelTotalWorkingYearsTrainingTimesLastYearWorkLifeBalanceYearsAtCompanyYearsInCurrentRoleYearsSinceLastPromotionYearsWithCurrManager
0411102122943245993...3108016405
149279813612225130...441103310717
2371373224922132090...3207330000
3331392344563132909...3308338730
427591211403123468...3416332222
..................................................................
1465368842323414242571...33117335203
146639613614422319991...3119537717
146727155432874226142...4216036203
1468491023234632225390...34017329608
146934628832824234404...3106344312

– We join categorical and numercial tables (without Attrition, our target):

X_all = pd.concat([X_cat, X_numerical], axis=1)
X_all
0123456789...PerformanceRatingRelationshipSatisfactionStockOptionLevelTotalWorkingYearsTrainingTimesLastYearWorkLifeBalanceYearsAtCompanyYearsInCurrentRoleYearsSinceLastPromotionYearsWithCurrManager
00.00.01.00.00.01.00.01.00.00.0...3108016405
10.01.00.00.01.00.00.01.00.00.0...441103310717
20.00.01.00.01.00.00.00.00.00.0...3207330000
30.01.00.00.01.00.00.01.00.00.0...3308338730
40.00.01.00.01.00.00.00.00.01.0...3416332222
..................................................................
14650.01.00.00.01.00.00.00.00.01.0...33117335203
14660.00.01.00.01.00.00.00.00.01.0...3119537717
14670.00.01.00.01.00.00.01.00.00.0...4216036203
14680.01.00.00.00.01.00.00.00.01.0...34017329608
14690.00.01.00.01.00.00.00.00.01.0...3106344312

Reescale the variables

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X = scaler.fit_transform(X_all)
X
array([[0.        , 0.        , 1.        , ..., 0.22222222, 0.        ,
        0.29411765],
       [0.        , 1.        , 0.        , ..., 0.38888889, 0.06666667,
        0.41176471],
       [0.        , 0.        , 1.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 1.        , ..., 0.11111111, 0.        ,
        0.17647059],
       [0.        , 1.        , 0.        , ..., 0.33333333, 0.        ,
        0.47058824],
       [0.        , 0.        , 1.        , ..., 0.16666667, 0.06666667,
        0.11764706]])

– Our target variable:

y = employee_df['Attrition']
y
0       1
1       0
2       1
3       0
4       0
       ..
1465    0
1466    0
1467    0
1468    0
1469    0
Name: Attrition, Length: 1470, dtype: int64

4 – Training and evaluate a classifier using a Logistic Regression

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
#25% of the dataset to train.
X_train.shape
(1102, 50)
X_test.shape
(368, 50)
model = LogisticRegression()
model.fit(X_train, y_train)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

– We use test data to predict:

y_pred = model.predict(X_test)
y_pred
array([0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1])
  • 1 –> The employee leaves
  • 0 –> The employe stay

Let ‘s see our accuracy

from sklearn.metrics import confusion_matrix, classification_report

print("Accuracy: {}".format(100*accuracy_score(y_pred, y_test)))
Accuracy: 85.86956521739131
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True)

We get a good accuracy(=88%), but we have to check the rest of the paramenters yet.

We analyze the Precision , Recall and F1-Score:

print(classification_report(y_test, y_pred))
    precision    recall  f1-score   support

           0       0.87      0.98      0.92       304
           1       0.73      0.30      0.42        64

    accuracy                           0.86       368
   macro avg       0.80      0.64      0.67       368
weighted avg       0.84      0.86      0.83       368

We get good results for the 0 category, but for the other one is bad.

5 – Training and evaluate a clasiffier using Random Forest

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier()
model.fit(X_train, y_train)
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)
y_pred = model.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True)
print(classification_report(y_test, y_pred))
              precision    recall  f1-score   support

           0       0.84      0.99      0.91       304
           1       0.64      0.11      0.19        64

    accuracy                           0.83       368
   macro avg       0.74      0.55      0.55       368
weighted avg       0.80      0.83      0.78       368

We continue to obtain poor results for 1.

6 – Training and evaluate a clasiffier using Deep Learning

import tensorflow as tf

model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Dense(units = 500, activation = 'relu', input_shape=(50, )))
model.add(tf.keras.layers.Dense(units = 500, activation = 'relu'))
model.add(tf.keras.layers.Dense(units = 500, activation = 'relu'))
model.add(tf.keras.layers.Dense(units = 1, activation = 'sigmoid'))

model.summary()
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_4 (Dense)              (None, 500)               25500     
_________________________________________________________________
dense_5 (Dense)              (None, 500)               250500    
_________________________________________________________________
dense_6 (Dense)              (None, 500)               250500    
_________________________________________________________________
dense_7 (Dense)              (None, 1)                 501       
=================================================================
Total params: 527,001
Trainable params: 527,001
Non-trainable params: 0
_________________________________________________________________
model.compile(optimizer='Adam', loss = 'binary_crossentropy', metrics=['accuracy'])

# oversampler = SMOTE(random_state=0)
# smote_train, smote_target = oversampler.fit_sample(X_train, y_train)
# epochs_hist = model.fit(smote_train, smote_target, epochs = 100, batch_size = 50)
epochs_hist = model.fit(X_train, y_train, epochs =100, batch_size = 50)
Epoch 1/100
23/23 [==============================] - 0s 6ms/step - loss: 0.4236 - accuracy: 0.8385
Epoch 2/100
23/23 [==============================] - 0s 6ms/step - loss: 0.3598 - accuracy: 0.8666
Epoch 3/100
23/23 [==============================] - 0s 6ms/step - loss: 0.2984 - accuracy: 0.8820
Epoch 4/100
23/23 [==============================] - 0s 9ms/step - loss: 0.2969 - accuracy: 0.8811
Epoch 5/100
23/23 [==============================] - 0s 7ms/step - loss: 0.2527 - accuracy: 0.9111
...................................................................................
Epoch 95/100
23/23 [==============================] - 0s 6ms/step - loss: 2.0451e-06 - accuracy: 1.0000
Epoch 96/100
23/23 [==============================] - 0s 6ms/step - loss: 1.9649e-06 - accuracy: 1.0000
Epoch 97/100
23/23 [==============================] - 0s 5ms/step - loss: 1.8953e-06 - accuracy: 1.0000
Epoch 98/100
23/23 [==============================] - 0s 5ms/step - loss: 1.8310e-06 - accuracy: 1.0000
Epoch 99/100
23/23 [==============================] - 0s 5ms/step - loss: 1.7649e-06 - accuracy: 1.0000
Epoch 100/100
23/23 [==============================] - 0s 5ms/step - loss: 1.7036e-06 - accuracy: 1.0000
Non-trainable params: 0
_________________________________________________________________
y_pred = model.predict(X_test)
y_pred
array([[1.02000641e-09],
       [1.00000000e+00],
       [2.50559333e-06],
       [9.55128326e-06],
       [1.95358858e-08],
       [1.54894892e-07],
       [9.45388039e-12],
       [2.08851762e-14],
       [1.06263491e-10],
       [3.86129813e-08],
       [8.14706087e-04],
       [7.68147324e-10],
       [3.01722114e-14],
       [7.42019329e-05],
       ...

This is the probability if the employee leaves the company.

– We are going to make a filter. If this indicator is above 0.5% the employee will leave:

y_pred = (y_pred>0.5)
y_pred
array([[False],
       [ True],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [ True],
       [False],
       [False],
       [False],
        ...
epochs_hist.history.keys()
dict_keys(['loss', 'accuracy'])
plt.plot(epochs_hist.history['loss'])
plt.title('Función de pérdidas del Modelo durante el entrenamiento')
plt.xlabel('Epochs')
plt.ylabel('Error de entrenamiento')
plt.legend(["Error de entrenamiento"])
plt.plot(epochs_hist.history['accuracy'])
plt.title('Tasa de acuerto del Modelo durante el entrenamiento')
plt.xlabel('Epochs')
plt.ylabel('Accuracy de entrenamiento')
plt.legend(["Accuracy de entrenamiento"])
cm =confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True)
print(classification_report(y_test, y_pred))
              precision    recall  f1-score   support

           0       0.87      0.94      0.90       304
           1       0.53      0.31      0.39        64

    accuracy                           0.83       368
   macro avg       0.70      0.63      0.65       368
weighted avg       0.81      0.83      0.81       368

Our accuracy is still poor.