Car Purcharse Prediction

Our work as a car salesman and we need to develop a model to predict the total dollar amount that customers are willing to pay. We have the following dataset:

Customer Name
Customer e-mail
Country
Gender
Age
Annual Salary
Credit Card Debt
Net Worth

Our predictor variable is Car Purchase Amount.

Import libraries and dataset

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv('Car_Purchasing_Data.csv', encoding='ISO-8859-1')
df.head()

	Customer Name	Customer e-mail	Country	Gender	Age	Annual Salary	Credit Card Debt	Net Worth	Car Purchase Amount
0	Martina Avila	cubilia.Curae.Phasellus@quisaccumsanconvallis.edu	Bulgaria	0	41.851720	62812.09301	11609.380910	238961.2505	35321.45877
1	Harlan Barnes	eu.dolor@diam.co.uk	Belize	0	40.870623	66646.89292	9572.957136	530973.9078	45115.52566
2	Naomi Rodriquez	vulputate.mauris.sagittis@ametconsectetueradip...	Algeria	1	43.152897	53798.55112	11160.355060	638467.1773	42925.70921
3	Jade Cunningham	malesuada@dignissim.com	Cook Islands	1	58.271369	79370.03798	14426.164850	548599.0524	67422.36313
4	Cedric Leach	felis.ullamcorper.viverra@egetmollislectus.net	Brazil	1	57.313749	59729.15130	5358.712177	560304.0671	55915.46248

Data Visualization

sns.pairplot(df)

We can see that there is a certain linear relation between Car Purchase Amount with Age, Annual Salary and Net Worth.

Creating testing and training dataset

– First, to construct our training daraset:

we are going to drop some features because they aren’t very usful for our purpose. We drop our predictor variable too.

X = df.drop(['Customer Name', 'Customer e-mail', 'Country', 'Car Purchase Amount'], axis = 1)
X.head()

	Gender	Age	Annual Salary	Credit Card Debt	Net Worth
0	0	41.851720	62812.09301	11609.380910	238961.2505
1	0	40.870623	66646.89292	9572.957136	530973.9078
2	1	43.152897	53798.55112	11160.355060	638467.1773
3	1	58.271369	79370.03798	14426.164850	548599.0524
4	1	57.313749	59729.15130	5358.712177	560304.0671

– Second, we are going to get our predictor variable:

y = df['Car Purchase Amount']
y.shape

    (500,)

Data normalization

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

scaler.data_max_

    array([1.e+00, 7.e+01, 1.e+05, 2.e+04, 1.e+06])

scaler.data_min_

    array([    0.,    20., 20000.,   100., 20000.])

print(X_scaled[:,0])

    [0. 0. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 0. 0. 0. 1. 0. 1. 1. 0. 1. 0. 1. 1.
     0. 0. 0. 0. 1. 1. 1. 1. 1. 0. 0. 0. 1. 1. 0. 1. 0. 1. 0. 1. 1. 0. 1. 0.
     1. 0. 1. 0. 0. 1. 1. 0. 1. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 0.
     0. 1. 1. 0. 1. 1. 0. 0. 1. 1. 1. 0. 1. 1. 1. 0. 1. 1. 1. 1. 1. 0. 1. 1.
     0. 1. 0. 0. 1. 1. 1. 0. 0. 0. 0. 1. 1. 1. 1. 1. 0. 0. 1. 1. 1. 0. 0. 0.
     1. 0. 1. 0. 0. 0. 0. 0. 1. 0. 1. 1. 0. 0. 0. 0. 0. 1. 1. 1. 0. 1. 0. 1.
     0. 0. 1. 0. 1. 1. 1. 1. 0. 0. 1. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0. 1. 1. 1.
     0. 1. 1. 1. 0. 1. 0. 0. 1. 0. 1. 1. 1. 0. 1. 1. 0. 1. 0. 0. 1. 1. 1. 1.
     0. 0. 0. 0. 0. 0. 1. 1. 1. 0. 1. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 1. 1. 0.
     1. 1. 0. 0. 1. 1. 0. 1. 1. 1. 1. 0. 1. 0. 1. 0. 1. 1. 1. 1. 1. 0. 0. 1.
     0. 1. 0. 1. 1. 0. 1. 0. 1. 0. 0. 0. 1. 1. 0. 1. 1. 1. 1. 0. 0. 0. 1. 1.
     0. 0. 1. 0. 0. 0. 0. 0. 1. 0. 1. 1. 1. 0. 0. 0. 0. 0. 0. 1. 1. 1. 0. 0.
     0. 0. 0. 1. 1. 1. 1. 1. 1. 0. 0. 1. 0. 0. 0. 1. 1. 1. 0. 1. 0. 0. 1. 0.
     0. 1. 0. 1. 0. 1. 1. 1. 0. 1. 0. 0. 0. 0. 0. 0. 1. 1. 0. 1. 0. 1. 0. 0.
     0. 1. 0. 1. 1. 1. 0. 0. 1. 1. 0. 0. 0. 1. 0. 0. 1. 0. 0. 1. 1. 1. 0. 1.
     1. 0. 1. 1. 1. 0. 1. 1. 1. 0. 1. 0. 1. 0. 1. 1. 0. 0. 0. 1. 0. 0. 1. 1.
     0. 1. 0. 1. 1. 0. 1. 1. 1. 1. 0. 1. 0. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 1.
     0. 1. 0. 1. 1. 0. 0. 1. 1. 0. 1. 1. 0. 0. 1. 0. 1. 0. 1. 0. 0. 0. 0. 0.
     1. 1. 1. 1. 0. 0. 0. 0. 1. 0. 1. 0. 1. 0. 1. 1. 1. 0. 1. 1. 0. 0. 1. 1.
     0. 1. 1. 0. 0. 1. 0. 1. 0. 0. 0. 1. 0. 0. 0. 0. 1. 1. 0. 1. 0. 1. 1. 0.
     0. 1. 1. 0. 0. 1. 1. 0. 0. 0. 1. 1. 0. 0. 0. 0. 1. 1. 1. 1.]

y.shape

    (500,)

y = y.values.reshape(-1,1)
y.shape

    (500, 1)

y_scaled = scaler.fit_transform(y)
y_scaled

    array([[0.37072477],
           [0.50866938],
           [0.47782689],
           [0.82285018],
           [0.66078116],
           [0.67059152],
           [0.28064374],
           [0.54133778],
           [0.54948752],
           [0.4111198 ],
           [0.70486638],
           [0.46885649],
           [0.27746526],
           ............
           ............
           ............
           [0.54592485],
           [0.77729956],
           [0.56199216],
           [0.31678049],
           [0.77672238],
           [0.51326977],
           [0.50855247]])

Training the model and built the neuronal network

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y_scaled, test_size = 0.25) #25% data to testing

from keras.models import Sequential
from keras.layers import Dense
from sklearn.preprocessing import MinMaxScaler

model = Sequential()
#First Layer
model.add(Dense(30, input_dim=5, activation='relu')) #30 neurons
#Second Layer
model.add(Dense(60, activation='relu')) #60 neurons
#Output Layer
model.add(Dense(1, activation='linear'))
model.summary()

    Model: "sequential"
    _________________________________________________________________
    Layer (type)                 Output Shape              Param #   
    =================================================================
    dense (Dense)                (None, 30)                180       
    _________________________________________________________________
    dense_1 (Dense)              (None, 60)                1860      
    _________________________________________________________________
    dense_2 (Dense)              (None, 1)                 61        
    =================================================================
    Total params: 2,101
    Trainable params: 2,101
    Non-trainable params: 0
    _________________________________________________________________

model.compile(optimizer='adam', loss='mean_squared_error')

epochs_hist = model.fit(X_train, y_train, epochs=100, batch_size=25,  verbose=1, validation_split=0.2)

    Epoch 1/100
    12/12 [==============================] - 0s 28ms/step - loss: 9.0786e-04 - val_loss: 9.8518e-04
    Epoch 2/100
    12/12 [==============================] - 0s 26ms/step - loss: 7.9939e-04 - val_loss: 8.7099e-04
    Epoch 3/100
    12/12 [==============================] - 0s 25ms/step - loss: 7.3383e-04 - val_loss: 7.9677e-04
    Epoch 4/100
    12/12 [==============================] - 0s 24ms/step - loss: 6.5883e-04 - val_loss: 7.0038e-04
    Epoch 5/100
    12/12 [==============================] - 0s 25ms/step - loss: 6.1143e-04 - val_loss: 6.2026e-04
    Epoch 6/100
    12/12 [==============================] - 0s 21ms/step - loss: 5.2812e-04 - val_loss: 5.4793e-04
    Epoch 7/100
    12/12 [==============================] - 0s 27ms/step - loss: 4.9849e-04 - val_loss: 4.8312e-04
    ...............................................................................................
    ...............................................................................................
    ...............................................................................................
    Epoch 97/100
    12/12 [==============================] - 0s 28ms/step - loss: 7.6140e-06 - val_loss: 2.1254e-05
    Epoch 98/100
    12/12 [==============================] - 0s 15ms/step - loss: 7.6621e-06 - val_loss: 2.1219e-05
    Epoch 99/100
    12/12 [==============================] - 0s 24ms/step - loss: 9.1768e-06 - val_loss: 2.3739e-05
    Epoch 100/100
    12/12 [==============================] - 0s 14ms/step - loss: 8.3331e-06 - val_loss: 2.2158e-05

Testing the model

print(epochs_hist.history.keys())

    dict_keys(['loss', 'val_loss'])

plt.plot(epochs_hist.history['loss'])
plt.plot(epochs_hist.history['val_loss'])

plt.title('Model Loss Progression During Training/Testing')
plt.ylabel('Training and Testing Losses')
plt.xlabel('Epoch')
plt.legend(['Training Loss', 'Testing Loss'])

Example with our model

# Gender, Age, Annual Salary, Credit Card Debt, Net Worth
X_Testing = np.array([[1, 50, 50000, 10985, 629312]])

y_predict = model.predict(X_Testing)
y_predict.shape

    (1, 1)

print('Expected Purchase Amount=', y_predict[:,0])

    Expected Purchase Amount= [244686.44]