Our work as a car salesman and we need to develop a model to predict the total dollar amount that customers are willing to pay. We have the following dataset:
- Customer Name
- Customer e-mail
- Country
- Gender
- Age
- Annual Salary
- Credit Card Debt
- Net Worth
Our predictor variable is Car Purchase Amount.
Import libraries and dataset
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns
df = pd.read_csv('Car_Purchasing_Data.csv', encoding='ISO-8859-1') df.head()
Customer Name | Customer e-mail | Country | Gender | Age | Annual Salary | Credit Card Debt | Net Worth | Car Purchase Amount | |
---|---|---|---|---|---|---|---|---|---|
0 | Martina Avila | cubilia.Curae.Phasellus@quisaccumsanconvallis.edu | Bulgaria | 0 | 41.851720 | 62812.09301 | 11609.380910 | 238961.2505 | 35321.45877 |
1 | Harlan Barnes | eu.dolor@diam.co.uk | Belize | 0 | 40.870623 | 66646.89292 | 9572.957136 | 530973.9078 | 45115.52566 |
2 | Naomi Rodriquez | vulputate.mauris.sagittis@ametconsectetueradip... | Algeria | 1 | 43.152897 | 53798.55112 | 11160.355060 | 638467.1773 | 42925.70921 |
3 | Jade Cunningham | malesuada@dignissim.com | Cook Islands | 1 | 58.271369 | 79370.03798 | 14426.164850 | 548599.0524 | 67422.36313 |
4 | Cedric Leach | felis.ullamcorper.viverra@egetmollislectus.net | Brazil | 1 | 57.313749 | 59729.15130 | 5358.712177 | 560304.0671 | 55915.46248 |
Data Visualization
sns.pairplot(df)
- We can see that there is a certain linear relation between Car Purchase Amount with Age, Annual Salary and Net Worth.
Creating testing and training dataset
– First, to construct our training daraset:
we are going to drop some features because they aren’t very usful for our purpose. We drop our predictor variable too.
X = df.drop(['Customer Name', 'Customer e-mail', 'Country', 'Car Purchase Amount'], axis = 1) X.head()
Gender | Age | Annual Salary | Credit Card Debt | Net Worth | |
---|---|---|---|---|---|
0 | 0 | 41.851720 | 62812.09301 | 11609.380910 | 238961.2505 |
1 | 0 | 40.870623 | 66646.89292 | 9572.957136 | 530973.9078 |
2 | 1 | 43.152897 | 53798.55112 | 11160.355060 | 638467.1773 |
3 | 1 | 58.271369 | 79370.03798 | 14426.164850 | 548599.0524 |
4 | 1 | 57.313749 | 59729.15130 | 5358.712177 | 560304.0671 |
– Second, we are going to get our predictor variable:
y = df['Car Purchase Amount'] y.shape
(500,)
Data normalization
from sklearn.preprocessing import MinMaxScaler scaler = MinMaxScaler() X_scaled = scaler.fit_transform(X)
scaler.data_max_
array([1.e+00, 7.e+01, 1.e+05, 2.e+04, 1.e+06])
scaler.data_min_
array([ 0., 20., 20000., 100., 20000.])
print(X_scaled[:,0])
[0. 0. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 0. 0. 0. 1. 0. 1. 1. 0. 1. 0. 1. 1. 0. 0. 0. 0. 1. 1. 1. 1. 1. 0. 0. 0. 1. 1. 0. 1. 0. 1. 0. 1. 1. 0. 1. 0. 1. 0. 1. 0. 0. 1. 1. 0. 1. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 0. 0. 1. 1. 0. 1. 1. 0. 0. 1. 1. 1. 0. 1. 1. 1. 0. 1. 1. 1. 1. 1. 0. 1. 1. 0. 1. 0. 0. 1. 1. 1. 0. 0. 0. 0. 1. 1. 1. 1. 1. 0. 0. 1. 1. 1. 0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 1. 0. 1. 1. 0. 0. 0. 0. 0. 1. 1. 1. 0. 1. 0. 1. 0. 0. 1. 0. 1. 1. 1. 1. 0. 0. 1. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0. 1. 1. 1. 0. 1. 1. 1. 0. 1. 0. 0. 1. 0. 1. 1. 1. 0. 1. 1. 0. 1. 0. 0. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 1. 1. 1. 0. 1. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 1. 1. 0. 1. 1. 0. 0. 1. 1. 0. 1. 1. 1. 1. 0. 1. 0. 1. 0. 1. 1. 1. 1. 1. 0. 0. 1. 0. 1. 0. 1. 1. 0. 1. 0. 1. 0. 0. 0. 1. 1. 0. 1. 1. 1. 1. 0. 0. 0. 1. 1. 0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 1. 1. 1. 0. 0. 0. 0. 0. 0. 1. 1. 1. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1. 1. 0. 0. 1. 0. 0. 0. 1. 1. 1. 0. 1. 0. 0. 1. 0. 0. 1. 0. 1. 0. 1. 1. 1. 0. 1. 0. 0. 0. 0. 0. 0. 1. 1. 0. 1. 0. 1. 0. 0. 0. 1. 0. 1. 1. 1. 0. 0. 1. 1. 0. 0. 0. 1. 0. 0. 1. 0. 0. 1. 1. 1. 0. 1. 1. 0. 1. 1. 1. 0. 1. 1. 1. 0. 1. 0. 1. 0. 1. 1. 0. 0. 0. 1. 0. 0. 1. 1. 0. 1. 0. 1. 1. 0. 1. 1. 1. 1. 0. 1. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 1. 0. 1. 0. 1. 1. 0. 0. 1. 1. 0. 1. 1. 0. 0. 1. 0. 1. 0. 1. 0. 0. 0. 0. 0. 1. 1. 1. 1. 0. 0. 0. 0. 1. 0. 1. 0. 1. 0. 1. 1. 1. 0. 1. 1. 0. 0. 1. 1. 0. 1. 1. 0. 0. 1. 0. 1. 0. 0. 0. 1. 0. 0. 0. 0. 1. 1. 0. 1. 0. 1. 1. 0. 0. 1. 1. 0. 0. 1. 1. 0. 0. 0. 1. 1. 0. 0. 0. 0. 1. 1. 1. 1.]
y.shape
(500,)
y = y.values.reshape(-1,1) y.shape
(500, 1)
y_scaled = scaler.fit_transform(y) y_scaled
array([[0.37072477], [0.50866938], [0.47782689], [0.82285018], [0.66078116], [0.67059152], [0.28064374], [0.54133778], [0.54948752], [0.4111198 ], [0.70486638], [0.46885649], [0.27746526], ............ ............ ............ [0.54592485], [0.77729956], [0.56199216], [0.31678049], [0.77672238], [0.51326977], [0.50855247]])
Training the model and built the neuronal network
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X_scaled, y_scaled, test_size = 0.25) #25% data to testing
from keras.models import Sequential from keras.layers import Dense from sklearn.preprocessing import MinMaxScaler model = Sequential() #First Layer model.add(Dense(30, input_dim=5, activation='relu')) #30 neurons #Second Layer model.add(Dense(60, activation='relu')) #60 neurons #Output Layer model.add(Dense(1, activation='linear')) model.summary()
Model: "sequential" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= dense (Dense) (None, 30) 180 _________________________________________________________________ dense_1 (Dense) (None, 60) 1860 _________________________________________________________________ dense_2 (Dense) (None, 1) 61 ================================================================= Total params: 2,101 Trainable params: 2,101 Non-trainable params: 0 _________________________________________________________________
model.compile(optimizer='adam', loss='mean_squared_error') epochs_hist = model.fit(X_train, y_train, epochs=100, batch_size=25, verbose=1, validation_split=0.2)
Epoch 1/100 12/12 [==============================] - 0s 28ms/step - loss: 9.0786e-04 - val_loss: 9.8518e-04 Epoch 2/100 12/12 [==============================] - 0s 26ms/step - loss: 7.9939e-04 - val_loss: 8.7099e-04 Epoch 3/100 12/12 [==============================] - 0s 25ms/step - loss: 7.3383e-04 - val_loss: 7.9677e-04 Epoch 4/100 12/12 [==============================] - 0s 24ms/step - loss: 6.5883e-04 - val_loss: 7.0038e-04 Epoch 5/100 12/12 [==============================] - 0s 25ms/step - loss: 6.1143e-04 - val_loss: 6.2026e-04 Epoch 6/100 12/12 [==============================] - 0s 21ms/step - loss: 5.2812e-04 - val_loss: 5.4793e-04 Epoch 7/100 12/12 [==============================] - 0s 27ms/step - loss: 4.9849e-04 - val_loss: 4.8312e-04 ............................................................................................... ............................................................................................... ............................................................................................... Epoch 97/100 12/12 [==============================] - 0s 28ms/step - loss: 7.6140e-06 - val_loss: 2.1254e-05 Epoch 98/100 12/12 [==============================] - 0s 15ms/step - loss: 7.6621e-06 - val_loss: 2.1219e-05 Epoch 99/100 12/12 [==============================] - 0s 24ms/step - loss: 9.1768e-06 - val_loss: 2.3739e-05 Epoch 100/100 12/12 [==============================] - 0s 14ms/step - loss: 8.3331e-06 - val_loss: 2.2158e-05
Testing the model
print(epochs_hist.history.keys())
dict_keys(['loss', 'val_loss'])
plt.plot(epochs_hist.history['loss']) plt.plot(epochs_hist.history['val_loss']) plt.title('Model Loss Progression During Training/Testing') plt.ylabel('Training and Testing Losses') plt.xlabel('Epoch') plt.legend(['Training Loss', 'Testing Loss'])
Example with our model
# Gender, Age, Annual Salary, Credit Card Debt, Net Worth X_Testing = np.array([[1, 50, 50000, 10985, 629312]]) y_predict = model.predict(X_Testing) y_predict.shape
(1, 1)
print('Expected Purchase Amount=', y_predict[:,0])
Expected Purchase Amount= [244686.44]