Segmentation to define a marketing strategy

[This work is based on this course: Data Science for Business | 6 Real-world Case Studies.]

We have a large dataset of a Bank. The sample dataset summarizes the usage behavior of about 9000 active credit card holders during last 6 months.

The marketing team wants to launch an advertising campaign by segmenting its customers into 3 different groups.

Data Dictionary for Credit Card dataset:

  • CUSTID: Identification of Credit Card holder (Categorical)
  • BALANCE: Balance amount left in their account to make purchases
  • BALANCEFREQUENCY: How frequently the Balance is updated, score between 0 and 1 (1 = frequently updated, 0 = not frequently updated)
  • PURCHASES: Amount of purchases made from account
  • ONEOFFPURCHASES: Maximum purchase amount done in one-go
  • INSTALLMENTSPURCHASES: Amount of purchase done in installment
  • CASHADVANCE: Cash in advance given by the user
  • PURCHASESFREQUENCY: How frequently the Purchases are being made, score between 0 and 1 (1 = frequently purchased, 0 = not frequently purchased)
  • ONEOFFPURCHASESFREQUENCY: How frequently Purchases are happening in one-go (1 = frequently purchased, 0 = not frequently purchased)
  • PURCHASESINSTALLMENTSFREQUENCY: How frequently purchases in installments are being done (1 = frequently done, 0 = not frequently done)
  • CASHADVANCEFREQUENCY: How frequently the cash in advance being paid
  • CASHADVANCETRX: Number of Transactions made with "Cash in Advanced"
  • PURCHASESTRX: Numbe of purchase transactions made
  • CREDITLIMIT: Limit of Credit Card for user
  • PAYMENTS: Amount of Payment done by user
  • MINIMUM_PAYMENTS: Minimum amount of payments made by user
  • PRCFULLPAYMENT: Percent of full payment paid by user
  • TENURE: Tenure of credit card service for user

Data Source: Credit Card Dataset

1 – Import libraries and dataset

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler, normalize
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
df = pd.read_csv("Marketing_data.csv")
df
CUST_IDBALANCEBALANCE_FREQUENCYPURCHASESONEOFF_PURCHASESINSTALLMENTS_PURCHASESCASH_ADVANCEPURCHASES_FREQUENCYONEOFF_PURCHASES_FREQUENCYPURCHASES_INSTALLMENTS_FREQUENCYCASH_ADVANCE_FREQUENCYCASH_ADVANCE_TRXPURCHASES_TRXCREDIT_LIMITPAYMENTSMINIMUM_PAYMENTSPRC_FULL_PAYMENTTENURE
0C1000140.9007490.81818295.400.0095.400.0000000.1666670.0000000.0833330.000000021000.0201.802084139.5097870.00000012
1C100023202.4674160.9090910.000.000.006442.9454830.0000000.0000000.0000000.250000407000.04103.0325971072.3402170.22222212
2C100032495.1488621.000000773.17773.170.000.0000001.0000001.0000000.0000000.0000000127500.0622.066742627.2847870.00000012
3C100041666.6705420.6363641499.001499.000.00205.7880170.0833330.0833330.0000000.083333117500.00.000000NaN0.00000012
4C10005817.7143351.00000016.0016.000.000.0000000.0833330.0833330.0000000.000000011200.0678.334763244.7912370.00000012
.........................................................
8945C1918628.4935171.000000291.120.00291.120.0000001.0000000.0000000.8333330.000000061000.0325.59446248.8863650.5000006
8946C1918719.1832151.000000300.000.00300.000.0000001.0000000.0000000.8333330.000000061000.0275.861322NaN0.0000006
8947C1918823.3986730.833333144.400.00144.400.0000000.8333330.0000000.6666670.000000051000.081.27077582.4183690.2500006
8948C1918913.4575640.8333330.000.000.0036.5587780.0000000.0000000.0000000.16666720500.052.54995955.7556280.2500006
8949C19190372.7080750.6666671093.251093.250.00127.0400080.6666670.6666670.0000000.3333332231200.063.16540488.2889560.0000006
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8950 entries, 0 to 8949
Data columns (total 18 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   CUST_ID                           8950 non-null   object 
 1   BALANCE                           8950 non-null   float64
 2   BALANCE_FREQUENCY                 8950 non-null   float64
 3   PURCHASES                         8950 non-null   float64
 4   ONEOFF_PURCHASES                  8950 non-null   float64
 5   INSTALLMENTS_PURCHASES            8950 non-null   float64
 6   CASH_ADVANCE                      8950 non-null   float64
 7   PURCHASES_FREQUENCY               8950 non-null   float64
 8   ONEOFF_PURCHASES_FREQUENCY        8950 non-null   float64
 9   PURCHASES_INSTALLMENTS_FREQUENCY  8950 non-null   float64
 10  CASH_ADVANCE_FREQUENCY            8950 non-null   float64
 11  CASH_ADVANCE_TRX                  8950 non-null   int64  
 12  PURCHASES_TRX                     8950 non-null   int64  
 13  CREDIT_LIMIT                      8949 non-null   float64
 14  PAYMENTS                          8950 non-null   float64
 15  MINIMUM_PAYMENTS                  8637 non-null   float64
 16  PRC_FULL_PAYMENT                  8950 non-null   float64
 17  TENURE                            8950 non-null   int64  
dtypes: float64(14), int64(3), object(1)
memory usage: 1.2+ MB
df.describe()
BALANCEBALANCE_FREQUENCYPURCHASESONEOFF_PURCHASESINSTALLMENTS_PURCHASESCASH_ADVANCEPURCHASES_FREQUENCYONEOFF_PURCHASES_FREQUENCYPURCHASES_INSTALLMENTS_FREQUENCYCASH_ADVANCE_FREQUENCYCASH_ADVANCE_TRXPURCHASES_TRXCREDIT_LIMITPAYMENTSMINIMUM_PAYMENTSPRC_FULL_PAYMENTTENURE
count8950.0000008950.0000008950.0000008950.0000008950.0000008950.0000008950.0000008950.0000008950.0000008950.0000008950.0000008950.0000008949.0000008950.0000008637.0000008950.0000008950.000000
mean1564.4748280.8772711003.204834592.437371411.067645978.8711120.4903510.2024580.3644370.1351443.24882714.7098324494.4494501733.143852864.2065420.15371511.517318
std2081.5318790.2369042136.6347821659.887917904.3381152097.1638770.4013710.2983360.3974480.2001216.82464724.8576493638.8157252895.0637572372.4466070.2924991.338331
min0.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.00000050.0000000.0000000.0191630.0000006.000000
25%128.2819150.88888939.6350000.0000000.0000000.0000000.0833330.0000000.0000000.0000000.0000001.0000001600.000000383.276166169.1237070.00000012.000000
50%873.3852311.000000361.28000038.00000089.0000000.0000000.5000000.0833330.1666670.0000000.0000007.0000003000.000000856.901546312.3439470.00000012.000000
75%2054.1400361.0000001110.130000577.405000468.6375001113.8211390.9166670.3000000.7500000.2222224.00000017.0000006500.0000001901.134317825.4854590.14285712.000000
max19043.1385601.00000049039.57000040761.25000022500.00000047137.2117601.0000001.0000001.0000001.500000123.000000358.00000030000.00000050721.48336076406.2075201.00000012.000000
  • The average balance is $1564.
  • The balance frequency is quite often updated, on average ~0.9 (frequently updated).
  • The average of purchases is $1000.
  • The average of maximum purchase amount done in one-go is ~$600.
  • The average of th frequently of purchases is ~0.5.
  • On the whole, the average of ONEOFF_PURCHASES_FREQUENCY, PURCHASES_INSTALLMENTS_FREQUENCY and CASH_ADVANCE_FREQUENCY are low.
  • The average of the limit of Credit Card for user is ~$4500
  • Percent of full payment paid by user is 15%
  • Tenure of credit card service for user is 11 years.

Let’s go to figure out who has made a purchase of $40761 in one-go:

df[df["ONEOFF_PURCHASES"] == 40761.25]
CUST_IDBALANCEBALANCE_FREQUENCYPURCHASESONEOFF_PURCHASESINSTALLMENTS_PURCHASESCASH_ADVANCEPURCHASES_FREQUENCYONEOFF_PURCHASES_FREQUENCYPURCHASES_INSTALLMENTS_FREQUENCYCASH_ADVANCE_FREQUENCYCASH_ADVANCE_TRXPURCHASES_TRXCREDIT_LIMITPAYMENTSMINIMUM_PAYMENTSPRC_FULL_PAYMENTTENURE
550C1057411547.520011.049039.5740761.258278.32558.1668861.01.00.9166670.083333110122500.046930.598242974.0694210.2512

This client did 123 number of Transactions made with “Cash in Advanced”.

2 – Dataset Visualization

2.1 – First we’re going to check if we have missing values:

sns.heatmap(df.isnull(), yticklabels=False, cbar = False, cmap = "Blues")
df.isnull().sum()
CUST_ID                               0
BALANCE                               0
BALANCE_FREQUENCY                     0
PURCHASES                             0
ONEOFF_PURCHASES                      0
INSTALLMENTS_PURCHASES                0
CASH_ADVANCE                          0
PURCHASES_FREQUENCY                   0
ONEOFF_PURCHASES_FREQUENCY            0
PURCHASES_INSTALLMENTS_FREQUENCY      0
CASH_ADVANCE_FREQUENCY                0
CASH_ADVANCE_TRX                      0
PURCHASES_TRX                         0
CREDIT_LIMIT                          1
PAYMENTS                              0
MINIMUM_PAYMENTS                    313
PRC_FULL_PAYMENT                      0
TENURE                                0
dtype: int64

– We fill the missing values with the average of ‘MINIMUM_PAYMENT’ and ‘CREDIT_LIMIT’:

df.loc[(df['MINIMUM_PAYMENTS'].isnull() == True), 'MINIMUM_PAYMENTS'] = df['MINIMUM_PAYMENTS'].mean()

df.loc[(df['CREDIT_LIMIT'].isnull() == True), 'CREDIT_LIMIT'] = df['CREDIT_LIMIT'].mean()

2.2 – Second we’re going to check if we have duplicate values in our data:

df.duplicated().sum()
0
sns.heatmap(df.isnull(), yticklabels=False, cbar = False, cmap = "Blues")

– We can remove CustomerID because it doesn’t useful:

df.drop("CUST_ID", axis = 1, inplace=True)
df.head()
BALANCEBALANCE_FREQUENCYPURCHASESONEOFF_PURCHASESINSTALLMENTS_PURCHASESCASH_ADVANCEPURCHASES_FREQUENCYONEOFF_PURCHASES_FREQUENCYPURCHASES_INSTALLMENTS_FREQUENCYCASH_ADVANCE_FREQUENCYCASH_ADVANCE_TRXPURCHASES_TRXCREDIT_LIMITPAYMENTSMINIMUM_PAYMENTSPRC_FULL_PAYMENTTENURE
040.9007490.81818295.400.0095.40.0000000.1666670.0000000.0833330.000000021000.0201.802084139.5097870.00000012
13202.4674160.9090910.000.000.06442.9454830.0000000.0000000.0000000.250000407000.04103.0325971072.3402170.22222212
22495.1488621.000000773.17773.170.00.0000001.0000001.0000000.0000000.0000000127500.0622.066742627.2847870.00000012
31666.6705420.6363641499.001499.000.0205.7880170.0833330.0833330.0000000.083333117500.00.000000864.2065420.00000012
4817.7143351.00000016.0016.000.00.0000000.0833330.0833330.0000000.000000011200.0678.334763244.7912370.00000012
n = len(df.columns)
n
17
df.columns
Index(['BALANCE', 'BALANCE_FREQUENCY', 'PURCHASES', 'ONEOFF_PURCHASES',
       'INSTALLMENTS_PURCHASES', 'CASH_ADVANCE', 'PURCHASES_FREQUENCY',
       'ONEOFF_PURCHASES_FREQUENCY', 'PURCHASES_INSTALLMENTS_FREQUENCY',
       'CASH_ADVANCE_FREQUENCY', 'CASH_ADVANCE_TRX', 'PURCHASES_TRX',
       'CREDIT_LIMIT', 'PAYMENTS', 'MINIMUM_PAYMENTS', 'PRC_FULL_PAYMENT',
       'TENURE'],
      dtype='object')

### The Kernel Density Estimate:

plt.figure(figsize = (10, 50))
for i in range(n):
    plt.subplot(n, 1, i+1)
    sns.distplot(df[df.columns[i]], kde_kws = {"color": "b", "lw": 3, "label": "KDE", 'bw': 0.2}, hist_kws={"color": "g"})
    plt.title(df.columns[i])

plt.tight_layout()

Correlation between variables:

correlations = df.corr()

f, ax = plt.subplots(figsize = (20,20))
sns.heatmap(correlations, annot = True)
  • There is a correlation between PURCHASES and ONEOFF_PURCHASES and INSTALMENT_PURCHASES variables.
  • We also can see a tend between PURCHASES with CREDIT_LIMIT and PAYMENTS.
  • High correlation between PURCHASES_FREQUENCY and PURCHASES_INSTALLMENT_FREQUENCY too.

3 – Finding the optimal number of clusters using ‘The Elbow Method’

  • ‘The Elbow Method’ consists of plotting the explained variation as a function of the number of clusters, and picking the elbow of the curve as the number of clusters to use. This method has been designed to help to find the appropriate number of clusters in our dataset.
  • If the line graph looks like an "arm", then the "elbow" on the arm is the value of k. This k will be our number of clusters.

Source:

First let’s start scaling our dataset. In this way no varaible will stand out.

scaler = StandardScaler()
df_scaled = scaler.fit_transform(df)

df_scaled.shape
(8950, 17)
df_scaled
array([[-0.73198937, -0.24943448, -0.42489974, ..., -0.31096755,
        -0.52555097,  0.36067954],
       [ 0.78696085,  0.13432467, -0.46955188, ...,  0.08931021,
         0.2342269 ,  0.36067954],
       [ 0.44713513,  0.51808382, -0.10766823, ..., -0.10166318,
        -0.52555097,  0.36067954],
       ...,
       [-0.7403981 , -0.18547673, -0.40196519, ..., -0.33546549,
         0.32919999, -4.12276757],
       [-0.74517423, -0.18547673, -0.46955188, ..., -0.34690648,
         0.32919999, -4.12276757],
       [-0.57257511, -0.88903307,  0.04214581, ..., -0.33294642,
        -0.52555097, -4.12276757]])

– We apply the elbow method:

scores_1 = []

range_values = range(1, 20)

for i in range_values:
    kmeans = KMeans(n_clusters = i)
    kmeans.fit(df_scaled)
    scores_1.append(kmeans.inertia_) #WCSS


plt.plot(range_values, scores_1, 'bx-')
plt.title("Encontrar el número óptimo de Clusters")
plt.xlabel("Clusters")
plt.ylabel("WCSS(k)")
plt.show()
  • The tendency of the "arm" tends to smooth the curve out (our "elbow") about cluster’s number 4.
  • The values aren’t reduced a linear tendency until the 8th cluster. We choose this number.

4 – K-Means Method

kmeans = KMeans(8)
kmeans.fit(df_scaled)
labels = kmeans.labels_
labels 
array([1, 3, 6, ..., 7, 7, 7], dtype=int32)

Cluster number of each point.

kmeans.cluster_centers_.shape #we have 8 baricentros con las 17 variables(features)
(8, 17)

We have 8 barycenters and 17 features.

cluster_centers = pd.DataFrame(data = kmeans.cluster_centers_, columns=[df.columns])
cluster_centers
BALANCEBALANCE_FREQUENCYPURCHASESONEOFF_PURCHASESINSTALLMENTS_PURCHASESCASH_ADVANCEPURCHASES_FREQUENCYONEOFF_PURCHASES_FREQUENCYPURCHASES_INSTALLMENTS_FREQUENCYCASH_ADVANCE_FREQUENCYCASH_ADVANCE_TRXPURCHASES_TRXCREDIT_LIMITPAYMENTSMINIMUM_PAYMENTSPRC_FULL_PAYMENTTENURE
0-0.3617460.332427-0.034919-0.2425340.362821-0.3632960.993026-0.3837681.205252-0.473549-0.3602500.189465-0.261298-0.214984-0.0304380.3134510.256258
10.0123630.404002-0.357001-0.241611-0.400131-0.094194-0.852812-0.394335-0.7545380.103185-0.028265-0.481551-0.303354-0.249373-0.011771-0.4553860.273280
21.8453100.34059512.29720112.8236705.5161580.2725301.0431772.1450280.896761-0.380373-0.1097304.5561363.1851519.0477991.0308981.2222640.298409
31.6861290.393025-0.217205-0.155325-0.2282872.009008-0.470733-0.207959-0.4111121.9135381.919868-0.2656131.0293790.8189560.552301-0.3901010.071370
4-0.701258-2.140285-0.310336-0.234722-0.302444-0.322272-0.554827-0.441460-0.440553-0.521236-0.376356-0.419376-0.176011-0.202115-0.2568000.2831650.198977
51.0394000.4648562.5046411.8084282.599812-0.1612051.1645021.5629821.272947-0.286097-0.1507103.1344871.2971811.4397860.5605360.2539150.337460
6-0.1329750.4006050.5415590.6714420.046770-0.3312390.9800111.9048130.171671-0.412879-0.3299260.6184060.4345970.144731-0.1580530.4443990.268773
7-0.336228-0.347383-0.287908-0.214195-0.2868750.067425-0.201713-0.285924-0.2241460.3070840.000231-0.387540-0.563820-0.392784-0.2092660.014243-3.202809

The inverse transformation of scaling is applied to understand better this values.

cluster_centers = scaler.inverse_transform(cluster_centers)
cluster_centers = pd.DataFrame(data = cluster_centers, columns=[df.columns])
cluster_centers
BALANCEBALANCE_FREQUENCYPURCHASESONEOFF_PURCHASESINSTALLMENTS_PURCHASESCASH_ADVANCEPURCHASES_FREQUENCYONEOFF_PURCHASES_FREQUENCYPURCHASES_INSTALLMENTS_FREQUENCYCASH_ADVANCE_FREQUENCYCASH_ADVANCE_TRXPURCHASES_TRXCREDIT_LIMITPAYMENTSMINIMUM_PAYMENTSPRC_FULL_PAYMENTTENURE
0811.5304310.956020928.599272189.879812739.162116217.0224990.8889000.0879720.8434360.0403820.79038719.4192273543.7414281110.786328793.2730980.24539411.860258
11590.2065890.972975240.466411191.41197649.234195781.3429220.1480760.0848200.0645650.1557933.0559392.7402833390.7252691011.232951836.7757770.02052211.883037
25405.3309350.95795527276.36375021877.1029175399.2608331550.3783890.9090280.8423610.7208330.0590282.500000127.95833316083.33333327925.6344963266.6710380.51120611.916667
35074.0100440.970375539.143482334.629072204.6308715191.8557510.3014220.1404190.2010510.51806316.3505158.1076758239.7532024103.9405652151.3204630.03961711.612829
4104.8653520.370257340.166450202.846306137.571031303.0513430.2676720.0707620.1893500.0308400.6804734.2857143854.0485581148.040394265.7441960.23653511.783601
53727.8983360.9873916354.4083623594.0574582762.045819640.8171960.9577210.6687250.8703390.0778932.22033992.6214699214.1242945901.1846112170.5122760.22798011.968927
61287.6988400.9721702160.2550091706.894046453.360963284.2472540.8836760.7707000.4326640.0525230.99732630.0811056075.6927562152.127140495.8707430.28369411.877005
7864.6453080.794979388.085586236.917416151.6497111120.2638740.4093930.1171610.2753560.1965953.2504015.0770472443.040850596.072587376.5219190.1578807.231140
  • First Cluster of Clients (Transactors): Those are the clients who pay the least amount of interest charges and they are very careful with their money. The lowest balance ($ 104) and cash advance ($ 303). Full payment = 23%.
  • Second Cluster of clients (Revolvers): They use the credit card as a loan (the most lucrative sector): higher balance ($ 5000) and cash advance ( ~ $ 5000), low purchase frequency, high advance frequency cash (0.5), high cash advance transactions ( 16`) and low payment percentage (3%).
  • Third Cluster of Clients (VIP / Prime): High credit limit $ 16K and higher percentage of full payment, goal to increase credit limit and increase spending habits.
  • Fourth Cluster of Clients (low tenure): They are clients with low seniority (7 years), low balance.
labels.shape
(8950,)
labels.min()
0
labels.max()
7

– We can also make predictions:

y_kmeans = kmeans.fit_predict(df_scaled)
y_kmeans
array([0, 2, 5, ..., 6, 6, 6], dtype=int32)

Let’s go to concatenate clusters’ labels with the original dataset. In this way we can see which cluster each observation belongs to:

df_cluster = pd.concat([df, pd.DataFrame({'cluster': labels})], axis = 1)
df_cluster.head()
BALANCEBALANCE_FREQUENCYPURCHASESONEOFF_PURCHASESINSTALLMENTS_PURCHASESCASH_ADVANCEPURCHASES_FREQUENCYONEOFF_PURCHASES_FREQUENCYPURCHASES_INSTALLMENTS_FREQUENCYCASH_ADVANCE_FREQUENCYCASH_ADVANCE_TRXPURCHASES_TRXCREDIT_LIMITPAYMENTSMINIMUM_PAYMENTSPRC_FULL_PAYMENTTENUREcluster
040.9007490.81818295.400.0095.40.0000000.1666670.0000000.0833330.000000021000.0201.802084139.5097870.000000121
13202.4674160.9090910.000.000.06442.9454830.0000000.0000000.0000000.250000407000.04103.0325971072.3402170.222222123
22495.1488621.000000773.17773.170.00.0000001.0000001.0000000.0000000.0000000127500.0622.066742627.2847870.000000126
31666.6705420.6363641499.001499.000.0205.7880170.0833330.0833330.0000000.083333117500.00.000000864.2065420.000000121
4817.7143351.00000016.0016.000.00.0000000.0833330.0833330.0000000.000000011200.0678.334763244.7912370.000000121

– We visualize histograms for each cluster:

for i in df.columns:
    plt.figure(figsize=(35, 5))
    for j in range(8):
        plt.subplot(1, 8, j+1)
        cluster = df_cluster[df_cluster['cluster'] == j]
        cluster[i].hist(bins = 20)
        plt.title('{}    \nCluster {}'.format(i, j))
    plt.show()

5 – Principal Component Analysis (PCA)

pca = PCA(n_components = 2)
principal_comp = pca.fit_transform(df_scaled)
principal_comp
array([[-1.68221922, -1.07645126],
       [-1.13829358,  2.50647075],
       [ 0.96968292, -0.38350887],
       ...,
       [-0.92620252, -1.81078537],
       [-2.33655126, -0.65796865],
       [-0.55642498, -0.40046541]])

We’re going to create a dataframe with the two components:

pca_df = pd.DataFrame(data = principal_comp, columns=["pca1", "pca2"])
pca_df.head()
pca1pca2
0-1.682219-1.076451
1-1.1382942.506471
20.969683-0.383509
3-0.8736270.043164
4-1.599433-0.688581

Let’s go to concatenate clusters’ labels with the principals components’ dataset:

pca_df = pd.concat([pca_df, pd.DataFrame({'cluster':labels})], axis = 1)
pca_df.head()
pca1pca2cluster
0-1.682219-1.0764511
1-1.1382942.5064713
20.969683-0.3835096
3-0.8736270.0431641
4-1.599433-0.6885811
plt.figure(figsize=(10,10))
ax = sns.scatterplot(x = "pca1", y = "pca2", hue = "cluster", data = pca_df,
                     palette = ["red", "green", "blue", "pink", "yellow", "gray", "purple", "black"])
plt.show()
  • PCA has reduced our 17 dimensions to only 2 (it must be taken into account the K-means method was applied before the PCA, that’s why some overlapping groups appear).
  • The pink’s cluster are much more scatted than the rest. While the yellow ones have a lot of information that goes to pca2, or the blue ones (they are grouped around pca1).
  • The rest of them are near to 0, which means that they have similar values.

6 – Autoencoders: Neuronal Network to reduce the dimension of our dataset.

AveragePooling2D, MaxPooling2D, Dropout
from tensorflow.keras.models import Model, load_model
from tensorflow.keras.initializers import glorot_uniform
from keras.optimizers import SGD #stochastic grade descent

#Compress all info into 7 dimensions.
encoding_dim = 7 

input_df = Input(shape = (17, ))

Let’s go to create de layers:

#The ReLU's activation function (Rectified Linear Unit)
x = Dense(encoding_dim, activation = 'relu')(input_df)

#3 hidden layers.
# Glorot normal inicializador (Xavier normal initializer) it takes a random samples of a truncated normal distribution
x = Dense(500, activation = 'relu', kernel_initializer = 'glorot_uniform')(x) 
x = Dense(500, activation = 'relu', kernel_initializer = 'glorot_uniform')(x) 
x = Dense(2000, activation = 'relu', kernel_initializer = 'glorot_uniform')(x) 

#We finally encode everything to only 10 neurons (the info in the encoder layer)
encoded = Dense(10, activation = 'relu', kernel_initializer = 'glorot_uniform')(x) 

#decompress (it does't have to be symmetrical, we avoid multicollinearity)
x = Dense(2000, activation = 'relu', kernel_initializer = 'glorot_uniform')(encoded) 
x = Dense(500, activation = 'relu', kernel_initializer = 'glorot_uniform')(x) 

#we finally decode everything to the initial 17 variables
decoded = Dense(17, kernel_initializer = 'glorot_uniform')(x)

#we define the autoencoder with the layer model
autoencoder = Model(input_df, decoded) 
autoencoder.compile(optimizer = 'adam', loss = 'mean_squared_error')

#we save the compression part, we'll use it later to verify our model
encoder = Model(input_df, encoded) 
df_scaled.shape
(8950, 17)
autoencoder.summary()
Model: "functional_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         [(None, 17)]              0         
_________________________________________________________________
dense (Dense)                (None, 7)                 126       
_________________________________________________________________
dense_1 (Dense)              (None, 500)               4000      
_________________________________________________________________
dense_2 (Dense)              (None, 500)               250500    
_________________________________________________________________
dense_3 (Dense)              (None, 2000)              1002000   
_________________________________________________________________
dense_4 (Dense)              (None, 10)                20010     
_________________________________________________________________
dense_5 (Dense)              (None, 2000)              22000     
_________________________________________________________________
dense_6 (Dense)              (None, 500)               1000500   
_________________________________________________________________
dense_7 (Dense)              (None, 17)                8517      
=================================================================
Total params: 2,307,653
Trainable params: 2,307,653
Non-trainable params: 0
_________________________________________________________________
autoencoder.fit(df_scaled, df_scaled, batch_size=128, epochs = 25, verbose = 1)
Epoch 1/25
70/70 [==============================] - 3s 42ms/step - loss: 0.6645
Epoch 2/25
70/70 [==============================] - 3s 37ms/step - loss: 0.3590
Epoch 3/25
70/70 [==============================] - 3s 39ms/step - loss: 0.2527
Epoch 4/25
70/70 [==============================] - 2s 35ms/step - loss: 0.1980
Epoch 5/25
70/70 [==============================] - 2s 35ms/step - loss: 0.1706
Epoch 6/25
70/70 [==============================] - 3s 44ms/step - loss: 0.1584
Epoch 7/25
70/70 [==============================] - 3s 37ms/step - loss: 0.1404
Epoch 8/25
70/70 [==============================] - 3s 40ms/step - loss: 0.1285
Epoch 9/25
70/70 [==============================] - 3s 37ms/step - loss: 0.1284
Epoch 10/25
70/70 [==============================] - 3s 45ms/step - loss: 0.1135
Epoch 11/25
70/70 [==============================] - 3s 45ms/step - loss: 0.1050
Epoch 12/25
70/70 [==============================] - 3s 38ms/step - loss: 0.1013
Epoch 13/25
70/70 [==============================] - 3s 43ms/step - loss: 0.0951
Epoch 14/25
70/70 [==============================] - 3s 40ms/step - loss: 0.0915
Epoch 15/25
70/70 [==============================] - 3s 36ms/step - loss: 0.0897
Epoch 16/25
70/70 [==============================] - 3s 40ms/step - loss: 0.0866
Epoch 17/25
70/70 [==============================] - 3s 39ms/step - loss: 0.0858
Epoch 18/25
70/70 [==============================] - 3s 38ms/step - loss: 0.0830
Epoch 19/25
70/70 [==============================] - 2s 33ms/step - loss: 0.0788
Epoch 20/25
70/70 [==============================] - 2s 35ms/step - loss: 0.0809
Epoch 21/25
70/70 [==============================] - 2s 33ms/step - loss: 0.0763
Epoch 22/25
70/70 [==============================] - 2s 33ms/step - loss: 0.0716
Epoch 23/25
70/70 [==============================] - 2s 33ms/step - loss: 0.0705
Epoch 24/25
70/70 [==============================] - 3s 37ms/step - loss: 0.0675
Epoch 25/25
70/70 [==============================] - 2s 33ms/step - loss: 0.0807

<tensorflow.python.keras.callbacks.History at 0x7fd8cc034250>

We obtain a loss function of 0.0657.

– We save the weights to make the predictions:

autoencoder.save_weights('autoencoder.h5') #.h5 neuronal netwotk standard extension 

– We apply the encoder created above (the first half of our network belongs to the encode part) to predict:

pred = encoder.predict(df_scaled)
pred.shape
(8950, 10)

We have 10 characteristics. They correspond to half of the neural network (as we can see in the summary above).

Let’s apply K-Means:

scores_2 = []

range_values = range(1,20)

for i in range_values:
    kmeans = KMeans(n_clusters = i)
    kmeans.fit(pred)
    scores_2.append(kmeans.inertia_)


plt.plot(range_values, scores_2, 'bx-')
plt.title("Encontrar el número óptimo de clusters")
plt.xlabel("Número de Clusters")
plt.ylabel("WCSS(k)")
plt.show()

The elbow stabilizes around $k=4$.

We’re going to compare the 2 elbows obtained in this work following different paths:

plt.plot(range_values, scores_1, 'bx-', color = "r")
plt.plot(range_values, scores_2, 'bx-', color = "g")

We keep $k=4$

kmeans = KMeans(4)
kmeans.fit(pred)
labels = kmeans.labels_
y_kmeans = kmeans.fit_predict(pred)
df_cluster_dr = pd.concat([df, pd.DataFrame({'cluster': labels})], axis = 1)
df_cluster_dr.head()
BALANCEBALANCE_FREQUENCYPURCHASESONEOFF_PURCHASESINSTALLMENTS_PURCHASESCASH_ADVANCEPURCHASES_FREQUENCYONEOFF_PURCHASES_FREQUENCYPURCHASES_INSTALLMENTS_FREQUENCYCASH_ADVANCE_FREQUENCYCASH_ADVANCE_TRXPURCHASES_TRXCREDIT_LIMITPAYMENTSMINIMUM_PAYMENTSPRC_FULL_PAYMENTTENUREcluster
040.9007490.81818295.400.0095.40.0000000.1666670.0000000.0833330.000000021000.0201.802084139.5097870.000000121
13202.4674160.9090910.000.000.06442.9454830.0000000.0000000.0000000.250000407000.04103.0325971072.3402170.222222120
22495.1488621.000000773.17773.170.00.0000001.0000001.0000000.0000000.0000000127500.0622.066742627.2847870.000000120
31666.6705420.6363641499.001499.000.0205.7880170.0833330.0833330.0000000.083333117500.00.000000864.2065420.000000120
4817.7143351.00000016.0016.000.00.0000000.0833330.0833330.0000000.000000011200.0678.334763244.7912370.000000121
pca = PCA(n_components=2)
princ_comp = pca.fit_transform(pred)
pca_df = pd.DataFrame(data = princ_comp, columns=["pca1", "pca2"])
pca_df.head()
pca1pca2
0-1.705835-0.162720
13.051461-1.535872
20.2924941.344226
30.6759920.319985
4-1.705685-0.063674
pca_df = pd.concat([pca_df, pd.DataFrame({"cluster":labels})], axis = 1)
pca_df
pca1pca2cluster
0-1.705835-0.1627201
13.051461-1.5358720
20.2924941.3442260
30.6759920.3199850
4-1.705685-0.0636741
............
8945-1.279127-0.1627931
8946-1.260987-0.5691041
8947-1.207046-0.2940981
8948-0.051287-0.8977541
89490.251878-0.2911930
plt.figure(figsize=(10,10))
ax = sns.scatterplot(x="pca1", y = "pca2", hue="cluster", data = pca_df, palette=["red", "green", "blue", "yellow"])
plt.show()

We can see how only with the first principal component (pca1) we could decide which cluster each observation belongs.