Model for HR department

[This work is based on this course: Data Science for Business | 6 Real-world Case Studies.]

When running a business, it takes money to make money. You make spending decisions every day for the better moment of your company. One of the most important investments you can make is in new people.

  • Hiring takes a lot of skills, patience, time and money.Most of all, small business owners spend around 40% of their working hours on tasks that do not generate income, such as hiring.

  • Companies spend from 15% to 20% of the employee’s annual salary, with more senior positions on the higher end of the scale.

  • It’s known that an average company loses anywhere between 1% and 2.5% of their total revenue on the time it takes to bring a new hire up to speed.

  • Hiring an employee in a company with 0-500 people costs an average of $7,645.

  • The average company in the United States spends about $4,000 to hire a new employee, taking up to 52 days to fill a position.

Our goal is to predict which employees are most likely to quit their job and a dataset has been provided to us.

Sources:

1 – Import the libraries and look at the dataset.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
employee_df=pd.read_csv("Human_Resources.csv")
employee_df.head()
employee_df=pd.read_csv("Human_Resources.csv") employee_df.head()
employee_df=pd.read_csv("Human_Resources.csv")
employee_df.head()
  • We need to predict the Attrition’s feature.
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
employee_df.columns
employee_df.columns
employee_df.columns
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
Index(['Age', 'Attrition', 'BusinessTravel', 'DailyRate', 'Department',
'DistanceFromHome', 'Education', 'EducationField', 'EmployeeCount',
'EmployeeNumber', 'EnvironmentSatisfaction', 'Gender', 'HourlyRate',
'JobInvolvement', 'JobLevel', 'JobRole', 'JobSatisfaction',
'MaritalStatus', 'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked',
'Over18', 'OverTime', 'PercentSalaryHike', 'PerformanceRating',
'RelationshipSatisfaction', 'StandardHours', 'StockOptionLevel',
'TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance',
'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion',
'YearsWithCurrManager'],
dtype='object')
Index(['Age', 'Attrition', 'BusinessTravel', 'DailyRate', 'Department', 'DistanceFromHome', 'Education', 'EducationField', 'EmployeeCount', 'EmployeeNumber', 'EnvironmentSatisfaction', 'Gender', 'HourlyRate', 'JobInvolvement', 'JobLevel', 'JobRole', 'JobSatisfaction', 'MaritalStatus', 'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked', 'Over18', 'OverTime', 'PercentSalaryHike', 'PerformanceRating', 'RelationshipSatisfaction', 'StandardHours', 'StockOptionLevel', 'TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance', 'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion', 'YearsWithCurrManager'], dtype='object')
Index(['Age', 'Attrition', 'BusinessTravel', 'DailyRate', 'Department',
       'DistanceFromHome', 'Education', 'EducationField', 'EmployeeCount',
       'EmployeeNumber', 'EnvironmentSatisfaction', 'Gender', 'HourlyRate',
       'JobInvolvement', 'JobLevel', 'JobRole', 'JobSatisfaction',
       'MaritalStatus', 'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked',
       'Over18', 'OverTime', 'PercentSalaryHike', 'PerformanceRating',
       'RelationshipSatisfaction', 'StandardHours', 'StockOptionLevel',
       'TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance',
       'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion',
       'YearsWithCurrManager'],
      dtype='object')
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
employee_df.info()
# 35 features in total with 1470 data points each
employee_df.info() # 35 features in total with 1470 data points each
employee_df.info()
# 35 features in total with 1470 data points each
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 35 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Age 1470 non-null int64
1 Attrition 1470 non-null object
2 BusinessTravel 1470 non-null object
3 DailyRate 1470 non-null int64
4 Department 1470 non-null object
5 DistanceFromHome 1470 non-null int64
6 Education 1470 non-null int64
7 EducationField 1470 non-null object
8 EmployeeCount 1470 non-null int64
9 EmployeeNumber 1470 non-null int64
10 EnvironmentSatisfaction 1470 non-null int64
11 Gender 1470 non-null object
12 HourlyRate 1470 non-null int64
13 JobInvolvement 1470 non-null int64
14 JobLevel 1470 non-null int64
15 JobRole 1470 non-null object
16 JobSatisfaction 1470 non-null int64
17 MaritalStatus 1470 non-null object
18 MonthlyIncome 1470 non-null int64
19 MonthlyRate 1470 non-null int64
20 NumCompaniesWorked 1470 non-null int64
21 Over18 1470 non-null object
22 OverTime 1470 non-null object
23 PercentSalaryHike 1470 non-null int64
24 PerformanceRating 1470 non-null int64
25 RelationshipSatisfaction 1470 non-null int64
26 StandardHours 1470 non-null int64
27 StockOptionLevel 1470 non-null int64
28 TotalWorkingYears 1470 non-null int64
29 TrainingTimesLastYear 1470 non-null int64
30 WorkLifeBalance 1470 non-null int64
31 YearsAtCompany 1470 non-null int64
32 YearsInCurrentRole 1470 non-null int64
33 YearsSinceLastPromotion 1470 non-null int64
34 YearsWithCurrManager 1470 non-null int64
dtypes: int64(26), object(9)
memory usage: 402.1+ KB
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1470 entries, 0 to 1469 Data columns (total 35 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Age 1470 non-null int64 1 Attrition 1470 non-null object 2 BusinessTravel 1470 non-null object 3 DailyRate 1470 non-null int64 4 Department 1470 non-null object 5 DistanceFromHome 1470 non-null int64 6 Education 1470 non-null int64 7 EducationField 1470 non-null object 8 EmployeeCount 1470 non-null int64 9 EmployeeNumber 1470 non-null int64 10 EnvironmentSatisfaction 1470 non-null int64 11 Gender 1470 non-null object 12 HourlyRate 1470 non-null int64 13 JobInvolvement 1470 non-null int64 14 JobLevel 1470 non-null int64 15 JobRole 1470 non-null object 16 JobSatisfaction 1470 non-null int64 17 MaritalStatus 1470 non-null object 18 MonthlyIncome 1470 non-null int64 19 MonthlyRate 1470 non-null int64 20 NumCompaniesWorked 1470 non-null int64 21 Over18 1470 non-null object 22 OverTime 1470 non-null object 23 PercentSalaryHike 1470 non-null int64 24 PerformanceRating 1470 non-null int64 25 RelationshipSatisfaction 1470 non-null int64 26 StandardHours 1470 non-null int64 27 StockOptionLevel 1470 non-null int64 28 TotalWorkingYears 1470 non-null int64 29 TrainingTimesLastYear 1470 non-null int64 30 WorkLifeBalance 1470 non-null int64 31 YearsAtCompany 1470 non-null int64 32 YearsInCurrentRole 1470 non-null int64 33 YearsSinceLastPromotion 1470 non-null int64 34 YearsWithCurrManager 1470 non-null int64 dtypes: int64(26), object(9) memory usage: 402.1+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 35 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Age                       1470 non-null   int64 
 1   Attrition                 1470 non-null   object
 2   BusinessTravel            1470 non-null   object
 3   DailyRate                 1470 non-null   int64 
 4   Department                1470 non-null   object
 5   DistanceFromHome          1470 non-null   int64 
 6   Education                 1470 non-null   int64 
 7   EducationField            1470 non-null   object
 8   EmployeeCount             1470 non-null   int64 
 9   EmployeeNumber            1470 non-null   int64 
 10  EnvironmentSatisfaction   1470 non-null   int64 
 11  Gender                    1470 non-null   object
 12  HourlyRate                1470 non-null   int64 
 13  JobInvolvement            1470 non-null   int64 
 14  JobLevel                  1470 non-null   int64 
 15  JobRole                   1470 non-null   object
 16  JobSatisfaction           1470 non-null   int64 
 17  MaritalStatus             1470 non-null   object
 18  MonthlyIncome             1470 non-null   int64 
 19  MonthlyRate               1470 non-null   int64 
 20  NumCompaniesWorked        1470 non-null   int64 
 21  Over18                    1470 non-null   object
 22  OverTime                  1470 non-null   object
 23  PercentSalaryHike         1470 non-null   int64 
 24  PerformanceRating         1470 non-null   int64 
 25  RelationshipSatisfaction  1470 non-null   int64 
 26  StandardHours             1470 non-null   int64 
 27  StockOptionLevel          1470 non-null   int64 
 28  TotalWorkingYears         1470 non-null   int64 
 29  TrainingTimesLastYear     1470 non-null   int64 
 30  WorkLifeBalance           1470 non-null   int64 
 31  YearsAtCompany            1470 non-null   int64 
 32  YearsInCurrentRole        1470 non-null   int64 
 33  YearsSinceLastPromotion   1470 non-null   int64 
 34  YearsWithCurrManager      1470 non-null   int64 
dtypes: int64(26), object(9)
memory usage: 402.1+ KB
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
employee_df.describe()
employee_df.describe()
employee_df.describe()
  • We can observe that the average age of the company is around 37 years and that the average number of years in the company is 7. Later we will go deeper into these and other features.
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
cat_columns = [cname for cname in employee_df.columns if employee_df[cname].dtype == "object"]
for i in cat_columns:
if i != 'Date' and i!= 'Time':
print("%s has %d elements: %s"%(i, len(employee_df[i].unique().tolist()), employee_df[i].unique().tolist()))
cat_columns = [cname for cname in employee_df.columns if employee_df[cname].dtype == "object"] for i in cat_columns: if i != 'Date' and i!= 'Time': print("%s has %d elements: %s"%(i, len(employee_df[i].unique().tolist()), employee_df[i].unique().tolist()))
cat_columns = [cname for cname in employee_df.columns if employee_df[cname].dtype == "object"]

for i in cat_columns:
    if i != 'Date' and i!= 'Time':
        print("%s has %d elements: %s"%(i, len(employee_df[i].unique().tolist()), employee_df[i].unique().tolist()))
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
Attrition has 2 elements: ['Yes', 'No']
BusinessTravel has 3 elements: ['Travel_Rarely', 'Travel_Frequently', 'Non-Travel']
Department has 3 elements: ['Sales', 'Research & Development', 'Human Resources']
EducationField has 6 elements: ['Life Sciences', 'Other', 'Medical', 'Marketing', 'Technical Degree', 'Human Resources']
Gender has 2 elements: ['Female', 'Male']
JobRole has 9 elements: ['Sales Executive', 'Research Scientist', 'Laboratory Technician', 'Manufacturing Director', 'Healthcare Representative', 'Manager', 'Sales Representative', 'Research Director', 'Human Resources']
MaritalStatus has 3 elements: ['Single', 'Married', 'Divorced']
Over18 has 1 elements: ['Y']
OverTime has 2 elements: ['Yes', 'No']
Attrition has 2 elements: ['Yes', 'No'] BusinessTravel has 3 elements: ['Travel_Rarely', 'Travel_Frequently', 'Non-Travel'] Department has 3 elements: ['Sales', 'Research & Development', 'Human Resources'] EducationField has 6 elements: ['Life Sciences', 'Other', 'Medical', 'Marketing', 'Technical Degree', 'Human Resources'] Gender has 2 elements: ['Female', 'Male'] JobRole has 9 elements: ['Sales Executive', 'Research Scientist', 'Laboratory Technician', 'Manufacturing Director', 'Healthcare Representative', 'Manager', 'Sales Representative', 'Research Director', 'Human Resources'] MaritalStatus has 3 elements: ['Single', 'Married', 'Divorced'] Over18 has 1 elements: ['Y'] OverTime has 2 elements: ['Yes', 'No']
Attrition has 2 elements: ['Yes', 'No']
BusinessTravel has 3 elements: ['Travel_Rarely', 'Travel_Frequently', 'Non-Travel']
Department has 3 elements: ['Sales', 'Research & Development', 'Human Resources']
EducationField has 6 elements: ['Life Sciences', 'Other', 'Medical', 'Marketing', 'Technical Degree', 'Human Resources']
Gender has 2 elements: ['Female', 'Male']
JobRole has 9 elements: ['Sales Executive', 'Research Scientist', 'Laboratory Technician', 'Manufacturing Director', 'Healthcare Representative', 'Manager', 'Sales Representative', 'Research Director', 'Human Resources']
MaritalStatus has 3 elements: ['Single', 'Married', 'Divorced']
Over18 has 1 elements: ['Y']
OverTime has 2 elements: ['Yes', 'No']

2 – Dataset visualization

We replace the ‘Attrition’, ‘Over18’ and ‘overtime’ columns by integers before we can carry out any visualization, because those features are binaries strings (‘Yes/No’)

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
employee_df['Attrition'] =employee_df['Attrition'].apply(lambda x: 1 if x== 'Yes' else 0)
employee_df['Over18'] = employee_df['Over18'].apply(lambda x: 1 if x=='Y' else 0)
employee_df['OverTime'] = employee_df['OverTime']. apply(lambda x: 1 if x=='Yes' else 0)
employee_df.head()
employee_df['Attrition'] =employee_df['Attrition'].apply(lambda x: 1 if x== 'Yes' else 0) employee_df['Over18'] = employee_df['Over18'].apply(lambda x: 1 if x=='Y' else 0) employee_df['OverTime'] = employee_df['OverTime']. apply(lambda x: 1 if x=='Yes' else 0) employee_df.head()
employee_df['Attrition'] =employee_df['Attrition'].apply(lambda x: 1 if x== 'Yes' else 0)
employee_df['Over18'] = employee_df['Over18'].apply(lambda x: 1 if x=='Y' else 0)
employee_df['OverTime'] = employee_df['OverTime']. apply(lambda x: 1 if x=='Yes' else 0)
employee_df.head()

– Let’s see if we’re missing data. We’ll use a heatmap:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
sns.heatmap(employee_df.isnull(), yticklabels=False, cbar=False, cmap = "Blues")
sns.heatmap(employee_df.isnull(), yticklabels=False, cbar=False, cmap = "Blues")
sns.heatmap(employee_df.isnull(), yticklabels=False, cbar=False, cmap = "Blues")
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
employee_df.hist(bins=30, figsize=(20,20), color='r')
employee_df.hist(bins=30, figsize=(20,20), color='r')
employee_df.hist(bins=30, figsize=(20,20), color='r')
  • Some features like ‘MonthlyIncome’ and ‘TotalWorkingYears’ have a distribution with a very long tail (long tail distribution)

  • It makes sense to remove ‘EmployeeCount’, ‘Standardhours’ and ‘Over18’, they are fields that don’t change from one employee to another one

  • We also remove ‘EmployeeNumber’, it belongs to a field to identify employees, it doesn’t have any effect in our study.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
employee_df.drop(["EmployeeCount", "StandardHours", "Over18", "EmployeeNumber"], axis=1, inplace = True)
employee_df.shape
employee_df.drop(["EmployeeCount", "StandardHours", "Over18", "EmployeeNumber"], axis=1, inplace = True) employee_df.shape
employee_df.drop(["EmployeeCount", "StandardHours", "Over18", "EmployeeNumber"], axis=1, inplace = True)
employee_df.shape
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
(1470, 31)
(1470, 31)
(1470, 31)

– Let’s see how many employees left and stayed in the company:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
left_df = employee_df[employee_df['Attrition']== 1]
stay_df = employee_df[employee_df['Attrition']== 0]
print('Left', round(employee_df['Attrition'].value_counts()[1]/len(employee_df) * 100,2), '% of the dataset')
print('Stay', round(employee_df['Attrition'].value_counts()[0]/len(employee_df) * 100,2), '% of the dataset')
left_df = employee_df[employee_df['Attrition']== 1] stay_df = employee_df[employee_df['Attrition']== 0] print('Left', round(employee_df['Attrition'].value_counts()[1]/len(employee_df) * 100,2), '% of the dataset') print('Stay', round(employee_df['Attrition'].value_counts()[0]/len(employee_df) * 100,2), '% of the dataset')
left_df = employee_df[employee_df['Attrition']== 1]
stay_df = employee_df[employee_df['Attrition']== 0]

print('Left', round(employee_df['Attrition'].value_counts()[1]/len(employee_df) * 100,2), '% of the dataset')
print('Stay', round(employee_df['Attrition'].value_counts()[0]/len(employee_df) * 100,2), '% of the dataset')
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
Left 16.12 % of the dataset
Stay 83.88 % of the dataset
Left 16.12 % of the dataset Stay 83.88 % of the dataset
Left 16.12 % of the dataset
Stay 83.88 % of the dataset

We are facing an unbalanced dataset.

– Now we’re going to compare the employees’ mean and standard error between those who have left and who have stayed:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
left_df.describe()
left_df.describe()
left_df.describe()
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
stay_df.describe()
stay_df.describe()
stay_df.describe()
  • ‘age’: Employees average age who stayed is higher than those who left (37vs.33`)
  • ‘DailyRate’: Employees` daily rate who stayed is higher.
  • ‘DistanceFromHome’: The employees who stayed in the company live closer to work.
  • ‘EnvironmentSatisfaction’ y ‘JobSatisfaction’: Most of the employees who stayed are more satisfied with their jobs.
  • ‘StockOptionLevel’: The employees who stayed have a high level of stocks options.
  • ‘OverTime’: The employees who left they worked almost double overtime.

Correlations between variables:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
correlations = employee_df.corr()
f, ax = plt.subplots(figsize = (20,20))
sns.heatmap(correlations, annot=True)
correlations = employee_df.corr() f, ax = plt.subplots(figsize = (20,20)) sns.heatmap(correlations, annot=True)
correlations = employee_df.corr()
f, ax = plt.subplots(figsize = (20,20))
sns.heatmap(correlations, annot=True)
  • ‘Job level’ is highly correlated with the total number of working hours.
  • ‘Monthly income’ is highly correlated with Job level and with the total number of working hours.
  • ‘Age’ is highly correlated with the Monthly income.

We compare the distributions:

Age vs. Attrition

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
plt.figure(figsize=(25,12))
sns.countplot(x='Age', hue='Attrition', data= employee_df)
plt.figure(figsize=(25,12)) sns.countplot(x='Age', hue='Attrition', data= employee_df)
plt.figure(figsize=(25,12))
sns.countplot(x='Age', hue='Attrition', data= employee_df)
  • The people are most likely to leave the company between 26 and 35 yeas old.

(JobRole & MaritalStatus & JobInvolvement & JobLevel) vs. Attrition

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
plt.figure(figsize=(20,20))
plt.subplot(411)
sns.countplot(x='JobRole', hue='Attrition', data= employee_df)
plt.subplot(412)
sns.countplot(x='MaritalStatus', hue='Attrition', data= employee_df)
plt.subplot(413)
sns.countplot(x='JobInvolvement', hue='Attrition', data= employee_df)
plt.subplot(414)
sns.countplot(x='JobLevel', hue='Attrition', data= employee_df)
plt.figure(figsize=(20,20)) plt.subplot(411) sns.countplot(x='JobRole', hue='Attrition', data= employee_df) plt.subplot(412) sns.countplot(x='MaritalStatus', hue='Attrition', data= employee_df) plt.subplot(413) sns.countplot(x='JobInvolvement', hue='Attrition', data= employee_df) plt.subplot(414) sns.countplot(x='JobLevel', hue='Attrition', data= employee_df)
plt.figure(figsize=(20,20))
plt.subplot(411)
sns.countplot(x='JobRole', hue='Attrition', data= employee_df)

plt.subplot(412)
sns.countplot(x='MaritalStatus', hue='Attrition', data= employee_df)

plt.subplot(413)
sns.countplot(x='JobInvolvement', hue='Attrition', data= employee_df)

plt.subplot(414)
sns.countplot(x='JobLevel', hue='Attrition', data= employee_df)
  • Single employees tend to leave the company compared to married and divorced employees.
  • Sales Representative us the section the most mobility than the others sections.
  • Less involved employees tend to leave the company.
  • The employees with a low level tend to leave the company.

Probability density estimation:

Distance from Home vs. Attrition

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
plt.figure(figsize=(12,7))
sns.kdeplot(left_df['DistanceFromHome'], label = 'Leave the company', shade=True, color='r')
sns.kdeplot(stay_df['DistanceFromHome'], label = 'Stay in the company', shade=True, color='b')
plt.xlabel('Distance from home to work')
plt.figure(figsize=(12,7)) sns.kdeplot(left_df['DistanceFromHome'], label = 'Leave the company', shade=True, color='r') sns.kdeplot(stay_df['DistanceFromHome'], label = 'Stay in the company', shade=True, color='b') plt.xlabel('Distance from home to work')
plt.figure(figsize=(12,7))
sns.kdeplot(left_df['DistanceFromHome'], label = 'Leave the company', shade=True, color='r')
sns.kdeplot(stay_df['DistanceFromHome'], label = 'Stay in the company', shade=True, color='b')

plt.xlabel('Distance from home to work')
  • In the 10-28 Km range we can see that it could exist some correlation variable between the employees who left ande who stay in the company.
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
plt.figure(figsize=(12,7))
sns.kdeplot(left_df['YearsWithCurrManager'], label = 'Leave the company', shade=True, color='r')
sns.kdeplot(stay_df['YearsWithCurrManager'], label = 'Stay in the company', shade=True, color='b')
plt.xlabel("Current Manager's years")
plt.figure(figsize=(12,7)) sns.kdeplot(left_df['YearsWithCurrManager'], label = 'Leave the company', shade=True, color='r') sns.kdeplot(stay_df['YearsWithCurrManager'], label = 'Stay in the company', shade=True, color='b') plt.xlabel("Current Manager's years")
plt.figure(figsize=(12,7))
sns.kdeplot(left_df['YearsWithCurrManager'], label = 'Leave the company', shade=True, color='r')
sns.kdeplot(stay_df['YearsWithCurrManager'], label = 'Stay in the company', shade=True, color='b')

plt.xlabel("Current Manager's years")
  • The less time they are with the same manager, they tend to leave the company more often than if they have been with the same manager for more years.
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
plt.figure(figsize=(12,7))
sns.kdeplot(left_df['TotalWorkingYears'], label = 'Leave the company', shade=True, color='r')
sns.kdeplot(stay_df['TotalWorkingYears'], label = 'Stay in the company', shade=True, color='b')
plt.xlabel('Total working years')
plt.figure(figsize=(12,7)) sns.kdeplot(left_df['TotalWorkingYears'], label = 'Leave the company', shade=True, color='r') sns.kdeplot(stay_df['TotalWorkingYears'], label = 'Stay in the company', shade=True, color='b') plt.xlabel('Total working years')
plt.figure(figsize=(12,7))
sns.kdeplot(left_df['TotalWorkingYears'], label = 'Leave the company', shade=True, color='r')
sns.kdeplot(stay_df['TotalWorkingYears'], label = 'Stay in the company', shade=True, color='b')

plt.xlabel('Total working years')
  • From 10 years the employees tend to stay.

Gender vs. Monthly Income

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
plt.figure(figsize=(10,8))
sns.boxplot(x='MonthlyIncome', y='Gender', data=employee_df)
plt.figure(figsize=(10,8)) sns.boxplot(x='MonthlyIncome', y='Gender', data=employee_df)
plt.figure(figsize=(10,8))
sns.boxplot(x='MonthlyIncome', y='Gender', data=employee_df)
  • This company doesn’t have wage inequity by gender.

Monthly Income vs. Job Role

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
plt.figure(figsize=(10,8))
sns.boxplot(x='MonthlyIncome', y='JobRole', data=employee_df)
plt.figure(figsize=(10,8)) sns.boxplot(x='MonthlyIncome', y='JobRole', data=employee_df)
plt.figure(figsize=(10,8))
sns.boxplot(x='MonthlyIncome', y='JobRole', data=employee_df)
  • Managers and Research Directors earn more money than the others roles.
  • Scientist and Laboratory Technician are poorly paid.
  • There’s a lot difference between high positions and middle-low positions.

3 – Test and training dataset

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
employee_df.head()
employee_df.head()
employee_df.head()

– Our categories:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
X_cat=employee_df[['BusinessTravel', 'Department', 'EducationField', 'Gender', 'JobRole', 'MaritalStatus']]
X_cat
X_cat=employee_df[['BusinessTravel', 'Department', 'EducationField', 'Gender', 'JobRole', 'MaritalStatus']] X_cat
X_cat=employee_df[['BusinessTravel', 'Department', 'EducationField', 'Gender', 'JobRole', 'MaritalStatus']]
X_cat
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
from sklearn.preprocessing import OneHotEncoder
onehotencoder = OneHotEncoder()
X_cat = onehotencoder.fit_transform(X_cat).toarray()
X_cat.shape
from sklearn.preprocessing import OneHotEncoder onehotencoder = OneHotEncoder() X_cat = onehotencoder.fit_transform(X_cat).toarray() X_cat.shape
from sklearn.preprocessing import OneHotEncoder

onehotencoder = OneHotEncoder()
X_cat = onehotencoder.fit_transform(X_cat).toarray()
X_cat.shape
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
(1470, 26)
(1470, 26)
(1470, 26)
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
X_cat= pd.DataFrame(X_cat)
X_cat
X_cat= pd.DataFrame(X_cat) X_cat
X_cat= pd.DataFrame(X_cat)
X_cat  

– We only take the numerical variables:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
numerical_columns = [cname for cname in employee_df.columns if (employee_df[cname].dtype == "int64" and cname!='Attrition')]
X_numerical= employee_df[numerical_columns]
X_numerical
numerical_columns = [cname for cname in employee_df.columns if (employee_df[cname].dtype == "int64" and cname!='Attrition')] X_numerical= employee_df[numerical_columns] X_numerical
numerical_columns = [cname for cname in employee_df.columns if (employee_df[cname].dtype == "int64" and cname!='Attrition')]
X_numerical= employee_df[numerical_columns]
X_numerical

– We join categorical and numercial tables (without Attrition, our target):

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
X_all = pd.concat([X_cat, X_numerical], axis=1)
X_all
X_all = pd.concat([X_cat, X_numerical], axis=1) X_all
X_all = pd.concat([X_cat, X_numerical], axis=1)
X_all

Reescale the variables

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X = scaler.fit_transform(X_all)
X
from sklearn.preprocessing import MinMaxScaler scaler = MinMaxScaler() X = scaler.fit_transform(X_all) X
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X = scaler.fit_transform(X_all)
X
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
array([[0. , 0. , 1. , ..., 0.22222222, 0. ,
0.29411765],
[0. , 1. , 0. , ..., 0.38888889, 0.06666667,
0.41176471],
[0. , 0. , 1. , ..., 0. , 0. ,
0. ],
...,
[0. , 0. , 1. , ..., 0.11111111, 0. ,
0.17647059],
[0. , 1. , 0. , ..., 0.33333333, 0. ,
0.47058824],
[0. , 0. , 1. , ..., 0.16666667, 0.06666667,
0.11764706]])
array([[0. , 0. , 1. , ..., 0.22222222, 0. , 0.29411765], [0. , 1. , 0. , ..., 0.38888889, 0.06666667, 0.41176471], [0. , 0. , 1. , ..., 0. , 0. , 0. ], ..., [0. , 0. , 1. , ..., 0.11111111, 0. , 0.17647059], [0. , 1. , 0. , ..., 0.33333333, 0. , 0.47058824], [0. , 0. , 1. , ..., 0.16666667, 0.06666667, 0.11764706]])
array([[0.        , 0.        , 1.        , ..., 0.22222222, 0.        ,
        0.29411765],
       [0.        , 1.        , 0.        , ..., 0.38888889, 0.06666667,
        0.41176471],
       [0.        , 0.        , 1.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 1.        , ..., 0.11111111, 0.        ,
        0.17647059],
       [0.        , 1.        , 0.        , ..., 0.33333333, 0.        ,
        0.47058824],
       [0.        , 0.        , 1.        , ..., 0.16666667, 0.06666667,
        0.11764706]])

– Our target variable:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
y = employee_df['Attrition']
y
y = employee_df['Attrition'] y
y = employee_df['Attrition']
y
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
0 1
1 0
2 1
3 0
4 0
..
1465 0
1466 0
1467 0
1468 0
1469 0
Name: Attrition, Length: 1470, dtype: int64
0 1 1 0 2 1 3 0 4 0 .. 1465 0 1466 0 1467 0 1468 0 1469 0 Name: Attrition, Length: 1470, dtype: int64
0       1
1       0
2       1
3       0
4       0
       ..
1465    0
1466    0
1467    0
1468    0
1469    0
Name: Attrition, Length: 1470, dtype: int64

4 – Training and evaluate a classifier using a Logistic Regression

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
#25% of the dataset to train.
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25) #25% of the dataset to train.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
#25% of the dataset to train.
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
X_train.shape
X_train.shape
X_train.shape
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
(1102, 50)
(1102, 50)
(1102, 50)
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
X_test.shape
X_test.shape
X_test.shape
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
(368, 50)
(368, 50)
(368, 50)
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
model = LogisticRegression()
model.fit(X_train, y_train)
model = LogisticRegression() model.fit(X_train, y_train)
model = LogisticRegression()
model.fit(X_train, y_train)
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=100,
multi_class='auto', n_jobs=None, penalty='l2',
random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
warm_start=False)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, l1_ratio=None, max_iter=100, multi_class='auto', n_jobs=None, penalty='l2', random_state=None, solver='lbfgs', tol=0.0001, verbose=0, warm_start=False)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

– We use test data to predict:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
y_pred = model.predict(X_test)
y_pred
y_pred = model.predict(X_test) y_pred
y_pred = model.predict(X_test)
y_pred
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
array([0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0,
0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1])
array([0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1])
array([0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1])
  • 1 –> The employee leaves
  • 0 –> The employe stay

Let ‘s see our accuracy

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
from sklearn.metrics import confusion_matrix, classification_report
print("Accuracy: {}".format(100*accuracy_score(y_pred, y_test)))
from sklearn.metrics import confusion_matrix, classification_report print("Accuracy: {}".format(100*accuracy_score(y_pred, y_test)))
from sklearn.metrics import confusion_matrix, classification_report

print("Accuracy: {}".format(100*accuracy_score(y_pred, y_test)))
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
Accuracy: 85.86956521739131
Accuracy: 85.86956521739131
Accuracy: 85.86956521739131
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True)
cm = confusion_matrix(y_test, y_pred) sns.heatmap(cm, annot=True)
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True)

We get a good accuracy(=88%), but we have to check the rest of the paramenters yet.

We analyze the Precision , Recall and F1-Score:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
print(classification_report(y_test, y_pred))
print(classification_report(y_test, y_pred))
print(classification_report(y_test, y_pred))
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
precision recall f1-score support
0 0.87 0.98 0.92 304
1 0.73 0.30 0.42 64
accuracy 0.86 368
macro avg 0.80 0.64 0.67 368
weighted avg 0.84 0.86 0.83 368
precision recall f1-score support 0 0.87 0.98 0.92 304 1 0.73 0.30 0.42 64 accuracy 0.86 368 macro avg 0.80 0.64 0.67 368 weighted avg 0.84 0.86 0.83 368
    precision    recall  f1-score   support

           0       0.87      0.98      0.92       304
           1       0.73      0.30      0.42        64

    accuracy                           0.86       368
   macro avg       0.80      0.64      0.67       368
weighted avg       0.84      0.86      0.83       368

We get good results for the 0 category, but for the other one is bad.

5 – Training and evaluate a clasiffier using Random Forest

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
model.fit(X_train, y_train)
from sklearn.ensemble import RandomForestClassifier model = RandomForestClassifier() model.fit(X_train, y_train)
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier()
model.fit(X_train, y_train)
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
criterion='gini', max_depth=None, max_features='auto',
max_leaf_nodes=None, max_samples=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=100,
n_jobs=None, oob_score=False, random_state=None,
verbose=0, warm_start=False)
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None, criterion='gini', max_depth=None, max_features='auto', max_leaf_nodes=None, max_samples=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None, oob_score=False, random_state=None, verbose=0, warm_start=False)
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
y_pred = model.predict(X_test)
y_pred = model.predict(X_test)
y_pred = model.predict(X_test)
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True)
cm = confusion_matrix(y_test, y_pred) sns.heatmap(cm, annot=True)
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True)
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
print(classification_report(y_test, y_pred))
print(classification_report(y_test, y_pred))
print(classification_report(y_test, y_pred))
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
precision recall f1-score support
0 0.84 0.99 0.91 304
1 0.64 0.11 0.19 64
accuracy 0.83 368
macro avg 0.74 0.55 0.55 368
weighted avg 0.80 0.83 0.78 368
precision recall f1-score support 0 0.84 0.99 0.91 304 1 0.64 0.11 0.19 64 accuracy 0.83 368 macro avg 0.74 0.55 0.55 368 weighted avg 0.80 0.83 0.78 368
              precision    recall  f1-score   support

           0       0.84      0.99      0.91       304
           1       0.64      0.11      0.19        64

    accuracy                           0.83       368
   macro avg       0.74      0.55      0.55       368
weighted avg       0.80      0.83      0.78       368

We continue to obtain poor results for 1.

6 – Training and evaluate a clasiffier using Deep Learning

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
import tensorflow as tf
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Dense(units = 500, activation = 'relu', input_shape=(50, )))
model.add(tf.keras.layers.Dense(units = 500, activation = 'relu'))
model.add(tf.keras.layers.Dense(units = 500, activation = 'relu'))
model.add(tf.keras.layers.Dense(units = 1, activation = 'sigmoid'))
model.summary()
import tensorflow as tf model = tf.keras.models.Sequential() model.add(tf.keras.layers.Dense(units = 500, activation = 'relu', input_shape=(50, ))) model.add(tf.keras.layers.Dense(units = 500, activation = 'relu')) model.add(tf.keras.layers.Dense(units = 500, activation = 'relu')) model.add(tf.keras.layers.Dense(units = 1, activation = 'sigmoid')) model.summary()
import tensorflow as tf

model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Dense(units = 500, activation = 'relu', input_shape=(50, )))
model.add(tf.keras.layers.Dense(units = 500, activation = 'relu'))
model.add(tf.keras.layers.Dense(units = 500, activation = 'relu'))
model.add(tf.keras.layers.Dense(units = 1, activation = 'sigmoid'))

model.summary()
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
Model: "sequential_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_4 (Dense) (None, 500) 25500
_________________________________________________________________
dense_5 (Dense) (None, 500) 250500
_________________________________________________________________
dense_6 (Dense) (None, 500) 250500
_________________________________________________________________
dense_7 (Dense) (None, 1) 501
=================================================================
Total params: 527,001
Trainable params: 527,001
Non-trainable params: 0
_________________________________________________________________
Model: "sequential_1" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= dense_4 (Dense) (None, 500) 25500 _________________________________________________________________ dense_5 (Dense) (None, 500) 250500 _________________________________________________________________ dense_6 (Dense) (None, 500) 250500 _________________________________________________________________ dense_7 (Dense) (None, 1) 501 ================================================================= Total params: 527,001 Trainable params: 527,001 Non-trainable params: 0 _________________________________________________________________
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_4 (Dense)              (None, 500)               25500     
_________________________________________________________________
dense_5 (Dense)              (None, 500)               250500    
_________________________________________________________________
dense_6 (Dense)              (None, 500)               250500    
_________________________________________________________________
dense_7 (Dense)              (None, 1)                 501       
=================================================================
Total params: 527,001
Trainable params: 527,001
Non-trainable params: 0
_________________________________________________________________
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
model.compile(optimizer='Adam', loss = 'binary_crossentropy', metrics=['accuracy'])
# oversampler = SMOTE(random_state=0)
# smote_train, smote_target = oversampler.fit_sample(X_train, y_train)
# epochs_hist = model.fit(smote_train, smote_target, epochs = 100, batch_size = 50)
epochs_hist = model.fit(X_train, y_train, epochs =100, batch_size = 50)
model.compile(optimizer='Adam', loss = 'binary_crossentropy', metrics=['accuracy']) # oversampler = SMOTE(random_state=0) # smote_train, smote_target = oversampler.fit_sample(X_train, y_train) # epochs_hist = model.fit(smote_train, smote_target, epochs = 100, batch_size = 50) epochs_hist = model.fit(X_train, y_train, epochs =100, batch_size = 50)
model.compile(optimizer='Adam', loss = 'binary_crossentropy', metrics=['accuracy'])

# oversampler = SMOTE(random_state=0)
# smote_train, smote_target = oversampler.fit_sample(X_train, y_train)
# epochs_hist = model.fit(smote_train, smote_target, epochs = 100, batch_size = 50)
epochs_hist = model.fit(X_train, y_train, epochs =100, batch_size = 50)
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
Epoch 1/100
23/23 [==============================] - 0s 6ms/step - loss: 0.4236 - accuracy: 0.8385
Epoch 2/100
23/23 [==============================] - 0s 6ms/step - loss: 0.3598 - accuracy: 0.8666
Epoch 3/100
23/23 [==============================] - 0s 6ms/step - loss: 0.2984 - accuracy: 0.8820
Epoch 4/100
23/23 [==============================] - 0s 9ms/step - loss: 0.2969 - accuracy: 0.8811
Epoch 5/100
23/23 [==============================] - 0s 7ms/step - loss: 0.2527 - accuracy: 0.9111
...................................................................................
Epoch 95/100
23/23 [==============================] - 0s 6ms/step - loss: 2.0451e-06 - accuracy: 1.0000
Epoch 96/100
23/23 [==============================] - 0s 6ms/step - loss: 1.9649e-06 - accuracy: 1.0000
Epoch 97/100
23/23 [==============================] - 0s 5ms/step - loss: 1.8953e-06 - accuracy: 1.0000
Epoch 98/100
23/23 [==============================] - 0s 5ms/step - loss: 1.8310e-06 - accuracy: 1.0000
Epoch 99/100
23/23 [==============================] - 0s 5ms/step - loss: 1.7649e-06 - accuracy: 1.0000
Epoch 100/100
23/23 [==============================] - 0s 5ms/step - loss: 1.7036e-06 - accuracy: 1.0000
Non-trainable params: 0
_________________________________________________________________
Epoch 1/100 23/23 [==============================] - 0s 6ms/step - loss: 0.4236 - accuracy: 0.8385 Epoch 2/100 23/23 [==============================] - 0s 6ms/step - loss: 0.3598 - accuracy: 0.8666 Epoch 3/100 23/23 [==============================] - 0s 6ms/step - loss: 0.2984 - accuracy: 0.8820 Epoch 4/100 23/23 [==============================] - 0s 9ms/step - loss: 0.2969 - accuracy: 0.8811 Epoch 5/100 23/23 [==============================] - 0s 7ms/step - loss: 0.2527 - accuracy: 0.9111 ................................................................................... Epoch 95/100 23/23 [==============================] - 0s 6ms/step - loss: 2.0451e-06 - accuracy: 1.0000 Epoch 96/100 23/23 [==============================] - 0s 6ms/step - loss: 1.9649e-06 - accuracy: 1.0000 Epoch 97/100 23/23 [==============================] - 0s 5ms/step - loss: 1.8953e-06 - accuracy: 1.0000 Epoch 98/100 23/23 [==============================] - 0s 5ms/step - loss: 1.8310e-06 - accuracy: 1.0000 Epoch 99/100 23/23 [==============================] - 0s 5ms/step - loss: 1.7649e-06 - accuracy: 1.0000 Epoch 100/100 23/23 [==============================] - 0s 5ms/step - loss: 1.7036e-06 - accuracy: 1.0000 Non-trainable params: 0 _________________________________________________________________
Epoch 1/100
23/23 [==============================] - 0s 6ms/step - loss: 0.4236 - accuracy: 0.8385
Epoch 2/100
23/23 [==============================] - 0s 6ms/step - loss: 0.3598 - accuracy: 0.8666
Epoch 3/100
23/23 [==============================] - 0s 6ms/step - loss: 0.2984 - accuracy: 0.8820
Epoch 4/100
23/23 [==============================] - 0s 9ms/step - loss: 0.2969 - accuracy: 0.8811
Epoch 5/100
23/23 [==============================] - 0s 7ms/step - loss: 0.2527 - accuracy: 0.9111
...................................................................................
Epoch 95/100
23/23 [==============================] - 0s 6ms/step - loss: 2.0451e-06 - accuracy: 1.0000
Epoch 96/100
23/23 [==============================] - 0s 6ms/step - loss: 1.9649e-06 - accuracy: 1.0000
Epoch 97/100
23/23 [==============================] - 0s 5ms/step - loss: 1.8953e-06 - accuracy: 1.0000
Epoch 98/100
23/23 [==============================] - 0s 5ms/step - loss: 1.8310e-06 - accuracy: 1.0000
Epoch 99/100
23/23 [==============================] - 0s 5ms/step - loss: 1.7649e-06 - accuracy: 1.0000
Epoch 100/100
23/23 [==============================] - 0s 5ms/step - loss: 1.7036e-06 - accuracy: 1.0000
Non-trainable params: 0
_________________________________________________________________
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
y_pred = model.predict(X_test)
y_pred
y_pred = model.predict(X_test) y_pred
y_pred = model.predict(X_test)
y_pred
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
array([[1.02000641e-09],
[1.00000000e+00],
[2.50559333e-06],
[9.55128326e-06],
[1.95358858e-08],
[1.54894892e-07],
[9.45388039e-12],
[2.08851762e-14],
[1.06263491e-10],
[3.86129813e-08],
[8.14706087e-04],
[7.68147324e-10],
[3.01722114e-14],
[7.42019329e-05],
...
array([[1.02000641e-09], [1.00000000e+00], [2.50559333e-06], [9.55128326e-06], [1.95358858e-08], [1.54894892e-07], [9.45388039e-12], [2.08851762e-14], [1.06263491e-10], [3.86129813e-08], [8.14706087e-04], [7.68147324e-10], [3.01722114e-14], [7.42019329e-05], ...
array([[1.02000641e-09],
       [1.00000000e+00],
       [2.50559333e-06],
       [9.55128326e-06],
       [1.95358858e-08],
       [1.54894892e-07],
       [9.45388039e-12],
       [2.08851762e-14],
       [1.06263491e-10],
       [3.86129813e-08],
       [8.14706087e-04],
       [7.68147324e-10],
       [3.01722114e-14],
       [7.42019329e-05],
       ...

This is the probability if the employee leaves the company.

– We are going to make a filter. If this indicator is above 0.5% the employee will leave:

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
y_pred = (y_pred>0.5)
y_pred
y_pred = (y_pred>0.5) y_pred
y_pred = (y_pred>0.5)
y_pred
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
array([[False],
[ True],
[False],
[False],
[False],
[False],
[False],
[False],
[False],
[False],
[False],
[False],
[False],
[False],
[False],
[False],
[False],
[ True],
[False],
[False],
[False],
...
array([[False], [ True], [False], [False], [False], [False], [False], [False], [False], [False], [False], [False], [False], [False], [False], [False], [False], [ True], [False], [False], [False], ...
array([[False],
       [ True],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [False],
       [ True],
       [False],
       [False],
       [False],
        ...
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
epochs_hist.history.keys()
epochs_hist.history.keys()
epochs_hist.history.keys()
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
dict_keys(['loss', 'accuracy'])
dict_keys(['loss', 'accuracy'])
dict_keys(['loss', 'accuracy'])
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
plt.plot(epochs_hist.history['loss'])
plt.title('Función de pérdidas del Modelo durante el entrenamiento')
plt.xlabel('Epochs')
plt.ylabel('Error de entrenamiento')
plt.legend(["Error de entrenamiento"])
plt.plot(epochs_hist.history['loss']) plt.title('Función de pérdidas del Modelo durante el entrenamiento') plt.xlabel('Epochs') plt.ylabel('Error de entrenamiento') plt.legend(["Error de entrenamiento"])
plt.plot(epochs_hist.history['loss'])
plt.title('Función de pérdidas del Modelo durante el entrenamiento')
plt.xlabel('Epochs')
plt.ylabel('Error de entrenamiento')
plt.legend(["Error de entrenamiento"])
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
plt.plot(epochs_hist.history['accuracy'])
plt.title('Tasa de acuerto del Modelo durante el entrenamiento')
plt.xlabel('Epochs')
plt.ylabel('Accuracy de entrenamiento')
plt.legend(["Accuracy de entrenamiento"])
plt.plot(epochs_hist.history['accuracy']) plt.title('Tasa de acuerto del Modelo durante el entrenamiento') plt.xlabel('Epochs') plt.ylabel('Accuracy de entrenamiento') plt.legend(["Accuracy de entrenamiento"])
plt.plot(epochs_hist.history['accuracy'])
plt.title('Tasa de acuerto del Modelo durante el entrenamiento')
plt.xlabel('Epochs')
plt.ylabel('Accuracy de entrenamiento')
plt.legend(["Accuracy de entrenamiento"])
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
cm =confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True)
cm =confusion_matrix(y_test, y_pred) sns.heatmap(cm, annot=True)
cm =confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True)
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
print(classification_report(y_test, y_pred))
print(classification_report(y_test, y_pred))
print(classification_report(y_test, y_pred))
Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
precision recall f1-score support
0 0.87 0.94 0.90 304
1 0.53 0.31 0.39 64
accuracy 0.83 368
macro avg 0.70 0.63 0.65 368
weighted avg 0.81 0.83 0.81 368
precision recall f1-score support 0 0.87 0.94 0.90 304 1 0.53 0.31 0.39 64 accuracy 0.83 368 macro avg 0.70 0.63 0.65 368 weighted avg 0.81 0.83 0.81 368
              precision    recall  f1-score   support

           0       0.87      0.94      0.90       304
           1       0.53      0.31      0.39        64

    accuracy                           0.83       368
   macro avg       0.70      0.63      0.65       368
weighted avg       0.81      0.83      0.81       368

Our accuracy is still poor.