Car Purcharse Prediction

Our work as a car salesman and we need to develop a model to predict the total dollar amount that customers are willing to pay. We have the following dataset:

  • Customer Name
  • Customer e-mail
  • Country
  • Gender
  • Age
  • Annual Salary
  • Credit Card Debt
  • Net Worth

Our predictor variable is Car Purchase Amount.

Import libraries and dataset

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv('Car_Purchasing_Data.csv', encoding='ISO-8859-1')
df.head()
Customer NameCustomer e-mailCountryGenderAgeAnnual SalaryCredit Card DebtNet WorthCar Purchase Amount
0Martina Avilacubilia.Curae.Phasellus@quisaccumsanconvallis.eduBulgaria041.85172062812.0930111609.380910238961.250535321.45877
1Harlan Barneseu.dolor@diam.co.ukBelize040.87062366646.892929572.957136530973.907845115.52566
2Naomi Rodriquezvulputate.mauris.sagittis@ametconsectetueradip...Algeria143.15289753798.5511211160.355060638467.177342925.70921
3Jade Cunninghammalesuada@dignissim.comCook Islands158.27136979370.0379814426.164850548599.052467422.36313
4Cedric Leachfelis.ullamcorper.viverra@egetmollislectus.netBrazil157.31374959729.151305358.712177560304.067155915.46248

Data Visualization

sns.pairplot(df)
  • We can see that there is a certain linear relation between Car Purchase Amount with Age, Annual Salary and Net Worth.

Creating testing and training dataset

– First, to construct our training daraset:

we are going to drop some features because they aren’t very usful for our purpose. We drop our predictor variable too.

X = df.drop(['Customer Name', 'Customer e-mail', 'Country', 'Car Purchase Amount'], axis = 1)
X.head()
GenderAgeAnnual SalaryCredit Card DebtNet Worth
0041.85172062812.0930111609.380910238961.2505
1040.87062366646.892929572.957136530973.9078
2143.15289753798.5511211160.355060638467.1773
3158.27136979370.0379814426.164850548599.0524
4157.31374959729.151305358.712177560304.0671

– Second, we are going to get our predictor variable:

y = df['Car Purchase Amount']
y.shape
    (500,)

Data normalization

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)
scaler.data_max_
    array([1.e+00, 7.e+01, 1.e+05, 2.e+04, 1.e+06])
scaler.data_min_
    array([    0.,    20., 20000.,   100., 20000.])
print(X_scaled[:,0])
    [0. 0. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 0. 0. 0. 1. 0. 1. 1. 0. 1. 0. 1. 1.
     0. 0. 0. 0. 1. 1. 1. 1. 1. 0. 0. 0. 1. 1. 0. 1. 0. 1. 0. 1. 1. 0. 1. 0.
     1. 0. 1. 0. 0. 1. 1. 0. 1. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 0.
     0. 1. 1. 0. 1. 1. 0. 0. 1. 1. 1. 0. 1. 1. 1. 0. 1. 1. 1. 1. 1. 0. 1. 1.
     0. 1. 0. 0. 1. 1. 1. 0. 0. 0. 0. 1. 1. 1. 1. 1. 0. 0. 1. 1. 1. 0. 0. 0.
     1. 0. 1. 0. 0. 0. 0. 0. 1. 0. 1. 1. 0. 0. 0. 0. 0. 1. 1. 1. 0. 1. 0. 1.
     0. 0. 1. 0. 1. 1. 1. 1. 0. 0. 1. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0. 1. 1. 1.
     0. 1. 1. 1. 0. 1. 0. 0. 1. 0. 1. 1. 1. 0. 1. 1. 0. 1. 0. 0. 1. 1. 1. 1.
     0. 0. 0. 0. 0. 0. 1. 1. 1. 0. 1. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 1. 1. 0.
     1. 1. 0. 0. 1. 1. 0. 1. 1. 1. 1. 0. 1. 0. 1. 0. 1. 1. 1. 1. 1. 0. 0. 1.
     0. 1. 0. 1. 1. 0. 1. 0. 1. 0. 0. 0. 1. 1. 0. 1. 1. 1. 1. 0. 0. 0. 1. 1.
     0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 1. 1. 1. 0. 0. 0. 0. 0. 0. 1. 1. 1. 0. 0.
     0. 0. 0. 1. 1. 1. 1. 1. 1. 0. 0. 1. 0. 0. 0. 1. 1. 1. 0. 1. 0. 0. 1. 0.
     0. 1. 0. 1. 0. 1. 1. 1. 0. 1. 0. 0. 0. 0. 0. 0. 1. 1. 0. 1. 0. 1. 0. 0.
     0. 1. 0. 1. 1. 1. 0. 0. 1. 1. 0. 0. 0. 1. 0. 0. 1. 0. 0. 1. 1. 1. 0. 1.
     1. 0. 1. 1. 1. 0. 1. 1. 1. 0. 1. 0. 1. 0. 1. 1. 0. 0. 0. 1. 0. 0. 1. 1.
     0. 1. 0. 1. 1. 0. 1. 1. 1. 1. 0. 1. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 1.
     0. 1. 0. 1. 1. 0. 0. 1. 1. 0. 1. 1. 0. 0. 1. 0. 1. 0. 1. 0. 0. 0. 0. 0.
     1. 1. 1. 1. 0. 0. 0. 0. 1. 0. 1. 0. 1. 0. 1. 1. 1. 0. 1. 1. 0. 0. 1. 1.
     0. 1. 1. 0. 0. 1. 0. 1. 0. 0. 0. 1. 0. 0. 0. 0. 1. 1. 0. 1. 0. 1. 1. 0.
     0. 1. 1. 0. 0. 1. 1. 0. 0. 0. 1. 1. 0. 0. 0. 0. 1. 1. 1. 1.]
y.shape
    (500,)
y = y.values.reshape(-1,1)
y.shape
    (500, 1)
y_scaled = scaler.fit_transform(y)
y_scaled
    array([[0.37072477],
           [0.50866938],
           [0.47782689],
           [0.82285018],
           [0.66078116],
           [0.67059152],
           [0.28064374],
           [0.54133778],
           [0.54948752],
           [0.4111198 ],
           [0.70486638],
           [0.46885649],
           [0.27746526],
           ............
           ............
           ............
           [0.54592485],
           [0.77729956],
           [0.56199216],
           [0.31678049],
           [0.77672238],
           [0.51326977],
           [0.50855247]])

Training the model and built the neuronal network

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y_scaled, test_size = 0.25) #25% data to testing
from keras.models import Sequential
from keras.layers import Dense
from sklearn.preprocessing import MinMaxScaler

model = Sequential()
#First Layer
model.add(Dense(30, input_dim=5, activation='relu')) #30 neurons
#Second Layer
model.add(Dense(60, activation='relu')) #60 neurons
#Output Layer
model.add(Dense(1, activation='linear'))
model.summary()
    Model: "sequential"
    _________________________________________________________________
    Layer (type)                 Output Shape              Param #   
    =================================================================
    dense (Dense)                (None, 30)                180       
    _________________________________________________________________
    dense_1 (Dense)              (None, 60)                1860      
    _________________________________________________________________
    dense_2 (Dense)              (None, 1)                 61        
    =================================================================
    Total params: 2,101
    Trainable params: 2,101
    Non-trainable params: 0
    _________________________________________________________________
model.compile(optimizer='adam', loss='mean_squared_error')

epochs_hist = model.fit(X_train, y_train, epochs=100, batch_size=25,  verbose=1, validation_split=0.2)
    Epoch 1/100
    12/12 [==============================] - 0s 28ms/step - loss: 9.0786e-04 - val_loss: 9.8518e-04
    Epoch 2/100
    12/12 [==============================] - 0s 26ms/step - loss: 7.9939e-04 - val_loss: 8.7099e-04
    Epoch 3/100
    12/12 [==============================] - 0s 25ms/step - loss: 7.3383e-04 - val_loss: 7.9677e-04
    Epoch 4/100
    12/12 [==============================] - 0s 24ms/step - loss: 6.5883e-04 - val_loss: 7.0038e-04
    Epoch 5/100
    12/12 [==============================] - 0s 25ms/step - loss: 6.1143e-04 - val_loss: 6.2026e-04
    Epoch 6/100
    12/12 [==============================] - 0s 21ms/step - loss: 5.2812e-04 - val_loss: 5.4793e-04
    Epoch 7/100
    12/12 [==============================] - 0s 27ms/step - loss: 4.9849e-04 - val_loss: 4.8312e-04
    ...............................................................................................
    ...............................................................................................
    ...............................................................................................
    Epoch 97/100
    12/12 [==============================] - 0s 28ms/step - loss: 7.6140e-06 - val_loss: 2.1254e-05
    Epoch 98/100
    12/12 [==============================] - 0s 15ms/step - loss: 7.6621e-06 - val_loss: 2.1219e-05
    Epoch 99/100
    12/12 [==============================] - 0s 24ms/step - loss: 9.1768e-06 - val_loss: 2.3739e-05
    Epoch 100/100
    12/12 [==============================] - 0s 14ms/step - loss: 8.3331e-06 - val_loss: 2.2158e-05

Testing the model

print(epochs_hist.history.keys())
    dict_keys(['loss', 'val_loss'])
plt.plot(epochs_hist.history['loss'])
plt.plot(epochs_hist.history['val_loss'])

plt.title('Model Loss Progression During Training/Testing')
plt.ylabel('Training and Testing Losses')
plt.xlabel('Epoch')
plt.legend(['Training Loss', 'Testing Loss'])

Example with our model

# Gender, Age, Annual Salary, Credit Card Debt, Net Worth
X_Testing = np.array([[1, 50, 50000, 10985, 629312]])

y_predict = model.predict(X_Testing)
y_predict.shape
    (1, 1)
print('Expected Purchase Amount=', y_predict[:,0])
    Expected Purchase Amount= [244686.44]