[This work is based on this course: Data Science for Business | 6 Real-world Case Studies.]

We have a large dataset of a Bank. The sample dataset summarizes the usage behavior of about 9000 active credit card holders during last 6 months.

The marketing team wants to launch an advertising campaign by segmenting its customers into 3 different groups.

Data Dictionary for Credit Card dataset:

CUSTID: Identification of Credit Card holder (Categorical)
BALANCE: Balance amount left in their account to make purchases
BALANCEFREQUENCY: How frequently the Balance is updated, score between 0 and 1 (1 = frequently updated, 0 = not frequently updated)
PURCHASES: Amount of purchases made from account
ONEOFFPURCHASES: Maximum purchase amount done in one-go
INSTALLMENTSPURCHASES: Amount of purchase done in installment
CASHADVANCE: Cash in advance given by the user
PURCHASESFREQUENCY: How frequently the Purchases are being made, score between 0 and 1 (1 = frequently purchased, 0 = not frequently purchased)
ONEOFFPURCHASESFREQUENCY: How frequently Purchases are happening in one-go (1 = frequently purchased, 0 = not frequently purchased)
PURCHASESINSTALLMENTSFREQUENCY: How frequently purchases in installments are being done (1 = frequently done, 0 = not frequently done)
CASHADVANCEFREQUENCY: How frequently the cash in advance being paid
CASHADVANCETRX: Number of Transactions made with "Cash in Advanced"
PURCHASESTRX: Numbe of purchase transactions made
CREDITLIMIT: Limit of Credit Card for user
PAYMENTS: Amount of Payment done by user
MINIMUM_PAYMENTS: Minimum amount of payments made by user
PRCFULLPAYMENT: Percent of full payment paid by user
TENURE: Tenure of credit card service for user

Data Source: Credit Card Dataset

1 – Import libraries and dataset

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler, normalize
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

df = pd.read_csv("Marketing_data.csv")
df

	CUST_ID	BALANCE	BALANCE_FREQUENCY	PURCHASES	ONEOFF_PURCHASES	INSTALLMENTS_PURCHASES	CASH_ADVANCE	PURCHASES_FREQUENCY	ONEOFF_PURCHASES_FREQUENCY	PURCHASES_INSTALLMENTS_FREQUENCY	CASH_ADVANCE_FREQUENCY	CASH_ADVANCE_TRX	PURCHASES_TRX	CREDIT_LIMIT	PAYMENTS	MINIMUM_PAYMENTS	PRC_FULL_PAYMENT	TENURE
0	C10001	40.900749	0.818182	95.40	0.00	95.40	0.000000	0.166667	0.000000	0.083333	0.000000	0	2	1000.0	201.802084	139.509787	0.000000	12
1	C10002	3202.467416	0.909091	0.00	0.00	0.00	6442.945483	0.000000	0.000000	0.000000	0.250000	4	0	7000.0	4103.032597	1072.340217	0.222222	12
2	C10003	2495.148862	1.000000	773.17	773.17	0.00	0.000000	1.000000	1.000000	0.000000	0.000000	0	12	7500.0	622.066742	627.284787	0.000000	12
3	C10004	1666.670542	0.636364	1499.00	1499.00	0.00	205.788017	0.083333	0.083333	0.000000	0.083333	1	1	7500.0	0.000000	NaN	0.000000	12
4	C10005	817.714335	1.000000	16.00	16.00	0.00	0.000000	0.083333	0.083333	0.000000	0.000000	0	1	1200.0	678.334763	244.791237	0.000000	12
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
8945	C19186	28.493517	1.000000	291.12	0.00	291.12	0.000000	1.000000	0.000000	0.833333	0.000000	0	6	1000.0	325.594462	48.886365	0.500000	6
8946	C19187	19.183215	1.000000	300.00	0.00	300.00	0.000000	1.000000	0.000000	0.833333	0.000000	0	6	1000.0	275.861322	NaN	0.000000	6
8947	C19188	23.398673	0.833333	144.40	0.00	144.40	0.000000	0.833333	0.000000	0.666667	0.000000	0	5	1000.0	81.270775	82.418369	0.250000	6
8948	C19189	13.457564	0.833333	0.00	0.00	0.00	36.558778	0.000000	0.000000	0.000000	0.166667	2	0	500.0	52.549959	55.755628	0.250000	6
8949	C19190	372.708075	0.666667	1093.25	1093.25	0.00	127.040008	0.666667	0.666667	0.000000	0.333333	2	23	1200.0	63.165404	88.288956	0.000000	6

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8950 entries, 0 to 8949
Data columns (total 18 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   CUST_ID                           8950 non-null   object 
 1   BALANCE                           8950 non-null   float64
 2   BALANCE_FREQUENCY                 8950 non-null   float64
 3   PURCHASES                         8950 non-null   float64
 4   ONEOFF_PURCHASES                  8950 non-null   float64
 5   INSTALLMENTS_PURCHASES            8950 non-null   float64
 6   CASH_ADVANCE                      8950 non-null   float64
 7   PURCHASES_FREQUENCY               8950 non-null   float64
 8   ONEOFF_PURCHASES_FREQUENCY        8950 non-null   float64
 9   PURCHASES_INSTALLMENTS_FREQUENCY  8950 non-null   float64
 10  CASH_ADVANCE_FREQUENCY            8950 non-null   float64
 11  CASH_ADVANCE_TRX                  8950 non-null   int64  
 12  PURCHASES_TRX                     8950 non-null   int64  
 13  CREDIT_LIMIT                      8949 non-null   float64
 14  PAYMENTS                          8950 non-null   float64
 15  MINIMUM_PAYMENTS                  8637 non-null   float64
 16  PRC_FULL_PAYMENT                  8950 non-null   float64
 17  TENURE                            8950 non-null   int64  
dtypes: float64(14), int64(3), object(1)
memory usage: 1.2+ MB

df.describe()

	BALANCE	BALANCE_FREQUENCY	PURCHASES	ONEOFF_PURCHASES	INSTALLMENTS_PURCHASES	CASH_ADVANCE	PURCHASES_FREQUENCY	ONEOFF_PURCHASES_FREQUENCY	PURCHASES_INSTALLMENTS_FREQUENCY	CASH_ADVANCE_FREQUENCY	CASH_ADVANCE_TRX	PURCHASES_TRX	CREDIT_LIMIT	PAYMENTS	MINIMUM_PAYMENTS	PRC_FULL_PAYMENT	TENURE
count	8950.000000	8950.000000	8950.000000	8950.000000	8950.000000	8950.000000	8950.000000	8950.000000	8950.000000	8950.000000	8950.000000	8950.000000	8949.000000	8950.000000	8637.000000	8950.000000	8950.000000
mean	1564.474828	0.877271	1003.204834	592.437371	411.067645	978.871112	0.490351	0.202458	0.364437	0.135144	3.248827	14.709832	4494.449450	1733.143852	864.206542	0.153715	11.517318
std	2081.531879	0.236904	2136.634782	1659.887917	904.338115	2097.163877	0.401371	0.298336	0.397448	0.200121	6.824647	24.857649	3638.815725	2895.063757	2372.446607	0.292499	1.338331
min	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	50.000000	0.000000	0.019163	0.000000	6.000000
25%	128.281915	0.888889	39.635000	0.000000	0.000000	0.000000	0.083333	0.000000	0.000000	0.000000	0.000000	1.000000	1600.000000	383.276166	169.123707	0.000000	12.000000
50%	873.385231	1.000000	361.280000	38.000000	89.000000	0.000000	0.500000	0.083333	0.166667	0.000000	0.000000	7.000000	3000.000000	856.901546	312.343947	0.000000	12.000000
75%	2054.140036	1.000000	1110.130000	577.405000	468.637500	1113.821139	0.916667	0.300000	0.750000	0.222222	4.000000	17.000000	6500.000000	1901.134317	825.485459	0.142857	12.000000
max	19043.138560	1.000000	49039.570000	40761.250000	22500.000000	47137.211760	1.000000	1.000000	1.000000	1.500000	123.000000	358.000000	30000.000000	50721.483360	76406.207520	1.000000	12.000000

The average balance is $1564.
The balance frequency is quite often updated, on average ~0.9 (frequently updated).
The average of purchases is $1000.
The average of maximum purchase amount done in one-go is ~$600.
The average of th frequently of purchases is ~0.5.
On the whole, the average of ONEOFF_PURCHASES_FREQUENCY, PURCHASES_INSTALLMENTS_FREQUENCY and CASH_ADVANCE_FREQUENCY are low.
The average of the limit of Credit Card for user is ~$4500
Percent of full payment paid by user is 15%
Tenure of credit card service for user is 11 years.

Let’s go to figure out who has made a purchase of $40761 in one-go:

df[df["ONEOFF_PURCHASES"] == 40761.25]

	CUST_ID	BALANCE	BALANCE_FREQUENCY	PURCHASES	ONEOFF_PURCHASES	INSTALLMENTS_PURCHASES	CASH_ADVANCE	PURCHASES_FREQUENCY	ONEOFF_PURCHASES_FREQUENCY	PURCHASES_INSTALLMENTS_FREQUENCY	CASH_ADVANCE_FREQUENCY	CASH_ADVANCE_TRX	PURCHASES_TRX	CREDIT_LIMIT	PAYMENTS	MINIMUM_PAYMENTS	PRC_FULL_PAYMENT	TENURE
550	C10574	11547.52001	1.0	49039.57	40761.25	8278.32	558.166886	1.0	1.0	0.916667	0.083333	1	101	22500.0	46930.59824	2974.069421	0.25	12

This client did 123 number of Transactions made with “Cash in Advanced”.

2 – Dataset Visualization

2.1 – First we’re going to check if we have missing values:

sns.heatmap(df.isnull(), yticklabels=False, cbar = False, cmap = "Blues")

df.isnull().sum()

CUST_ID                               0
BALANCE                               0
BALANCE_FREQUENCY                     0
PURCHASES                             0
ONEOFF_PURCHASES                      0
INSTALLMENTS_PURCHASES                0
CASH_ADVANCE                          0
PURCHASES_FREQUENCY                   0
ONEOFF_PURCHASES_FREQUENCY            0
PURCHASES_INSTALLMENTS_FREQUENCY      0
CASH_ADVANCE_FREQUENCY                0
CASH_ADVANCE_TRX                      0
PURCHASES_TRX                         0
CREDIT_LIMIT                          1
PAYMENTS                              0
MINIMUM_PAYMENTS                    313
PRC_FULL_PAYMENT                      0
TENURE                                0
dtype: int64

– We fill the missing values with the average of ‘MINIMUM_PAYMENT’ and ‘CREDIT_LIMIT’:

df.loc[(df['MINIMUM_PAYMENTS'].isnull() == True), 'MINIMUM_PAYMENTS'] = df['MINIMUM_PAYMENTS'].mean()

df.loc[(df['CREDIT_LIMIT'].isnull() == True), 'CREDIT_LIMIT'] = df['CREDIT_LIMIT'].mean()

2.2 – Second we’re going to check if we have duplicate values in our data:

df.duplicated().sum()

sns.heatmap(df.isnull(), yticklabels=False, cbar = False, cmap = "Blues")

– We can remove CustomerID because it doesn’t useful:

df.drop("CUST_ID", axis = 1, inplace=True)
df.head()

	BALANCE	BALANCE_FREQUENCY	PURCHASES	ONEOFF_PURCHASES	INSTALLMENTS_PURCHASES	CASH_ADVANCE	PURCHASES_FREQUENCY	ONEOFF_PURCHASES_FREQUENCY	PURCHASES_INSTALLMENTS_FREQUENCY	CASH_ADVANCE_FREQUENCY	CASH_ADVANCE_TRX	PURCHASES_TRX	CREDIT_LIMIT	PAYMENTS	MINIMUM_PAYMENTS	PRC_FULL_PAYMENT	TENURE
0	40.900749	0.818182	95.40	0.00	95.4	0.000000	0.166667	0.000000	0.083333	0.000000	0	2	1000.0	201.802084	139.509787	0.000000	12
1	3202.467416	0.909091	0.00	0.00	0.0	6442.945483	0.000000	0.000000	0.000000	0.250000	4	0	7000.0	4103.032597	1072.340217	0.222222	12
2	2495.148862	1.000000	773.17	773.17	0.0	0.000000	1.000000	1.000000	0.000000	0.000000	0	12	7500.0	622.066742	627.284787	0.000000	12
3	1666.670542	0.636364	1499.00	1499.00	0.0	205.788017	0.083333	0.083333	0.000000	0.083333	1	1	7500.0	0.000000	864.206542	0.000000	12
4	817.714335	1.000000	16.00	16.00	0.0	0.000000	0.083333	0.083333	0.000000	0.000000	0	1	1200.0	678.334763	244.791237	0.000000	12

n = len(df.columns)
n

df.columns

Index(['BALANCE', 'BALANCE_FREQUENCY', 'PURCHASES', 'ONEOFF_PURCHASES',
       'INSTALLMENTS_PURCHASES', 'CASH_ADVANCE', 'PURCHASES_FREQUENCY',
       'ONEOFF_PURCHASES_FREQUENCY', 'PURCHASES_INSTALLMENTS_FREQUENCY',
       'CASH_ADVANCE_FREQUENCY', 'CASH_ADVANCE_TRX', 'PURCHASES_TRX',
       'CREDIT_LIMIT', 'PAYMENTS', 'MINIMUM_PAYMENTS', 'PRC_FULL_PAYMENT',
       'TENURE'],
      dtype='object')

### The Kernel Density Estimate:

plt.figure(figsize = (10, 50))
for i in range(n):
    plt.subplot(n, 1, i+1)
    sns.distplot(df[df.columns[i]], kde_kws = {"color": "b", "lw": 3, "label": "KDE", 'bw': 0.2}, hist_kws={"color": "g"})
    plt.title(df.columns[i])

plt.tight_layout()

Correlation between variables:

correlations = df.corr()

f, ax = plt.subplots(figsize = (20,20))
sns.heatmap(correlations, annot = True)

There is a correlation between PURCHASES and ONEOFF_PURCHASES and INSTALMENT_PURCHASES variables.
We also can see a tend between PURCHASES with CREDIT_LIMIT and PAYMENTS.
High correlation between PURCHASES_FREQUENCY and PURCHASES_INSTALLMENT_FREQUENCY too.

3 – Finding the optimal number of clusters using ‘The Elbow Method’

‘The Elbow Method’ consists of plotting the explained variation as a function of the number of clusters, and picking the elbow of the curve as the number of clusters to use. This method has been designed to help to find the appropriate number of clusters in our dataset.
If the line graph looks like an "arm", then the "elbow" on the arm is the value of k. This k will be our number of clusters.

Source:

First let’s start scaling our dataset. In this way no varaible will stand out.

scaler = StandardScaler()
df_scaled = scaler.fit_transform(df)

df_scaled.shape

(8950, 17)

df_scaled

array([[-0.73198937, -0.24943448, -0.42489974, ..., -0.31096755,
        -0.52555097,  0.36067954],
       [ 0.78696085,  0.13432467, -0.46955188, ...,  0.08931021,
         0.2342269 ,  0.36067954],
       [ 0.44713513,  0.51808382, -0.10766823, ..., -0.10166318,
        -0.52555097,  0.36067954],
       ...,
       [-0.7403981 , -0.18547673, -0.40196519, ..., -0.33546549,
         0.32919999, -4.12276757],
       [-0.74517423, -0.18547673, -0.46955188, ..., -0.34690648,
         0.32919999, -4.12276757],
       [-0.57257511, -0.88903307,  0.04214581, ..., -0.33294642,
        -0.52555097, -4.12276757]])

– We apply the elbow method:

scores_1 = []

range_values = range(1, 20)

for i in range_values:
    kmeans = KMeans(n_clusters = i)
    kmeans.fit(df_scaled)
    scores_1.append(kmeans.inertia_) #WCSS


plt.plot(range_values, scores_1, 'bx-')
plt.title("Encontrar el número óptimo de Clusters")
plt.xlabel("Clusters")
plt.ylabel("WCSS(k)")
plt.show()

The tendency of the "arm" tends to smooth the curve out (our "elbow") about cluster’s number 4.
The values aren’t reduced a linear tendency until the 8th cluster. We choose this number.

4 – K-Means Method

kmeans = KMeans(8)
kmeans.fit(df_scaled)
labels = kmeans.labels_
labels

array([1, 3, 6, ..., 7, 7, 7], dtype=int32)

Cluster number of each point.

kmeans.cluster_centers_.shape #we have 8 baricentros con las 17 variables(features)

(8, 17)

We have 8 barycenters and 17 features.

cluster_centers = pd.DataFrame(data = kmeans.cluster_centers_, columns=[df.columns])
cluster_centers

	BALANCE	BALANCE_FREQUENCY	PURCHASES	ONEOFF_PURCHASES	INSTALLMENTS_PURCHASES	CASH_ADVANCE	PURCHASES_FREQUENCY	ONEOFF_PURCHASES_FREQUENCY	PURCHASES_INSTALLMENTS_FREQUENCY	CASH_ADVANCE_FREQUENCY	CASH_ADVANCE_TRX	PURCHASES_TRX	CREDIT_LIMIT	PAYMENTS	MINIMUM_PAYMENTS	PRC_FULL_PAYMENT	TENURE
0	-0.361746	0.332427	-0.034919	-0.242534	0.362821	-0.363296	0.993026	-0.383768	1.205252	-0.473549	-0.360250	0.189465	-0.261298	-0.214984	-0.030438	0.313451	0.256258
1	0.012363	0.404002	-0.357001	-0.241611	-0.400131	-0.094194	-0.852812	-0.394335	-0.754538	0.103185	-0.028265	-0.481551	-0.303354	-0.249373	-0.011771	-0.455386	0.273280
2	1.845310	0.340595	12.297201	12.823670	5.516158	0.272530	1.043177	2.145028	0.896761	-0.380373	-0.109730	4.556136	3.185151	9.047799	1.030898	1.222264	0.298409
3	1.686129	0.393025	-0.217205	-0.155325	-0.228287	2.009008	-0.470733	-0.207959	-0.411112	1.913538	1.919868	-0.265613	1.029379	0.818956	0.552301	-0.390101	0.071370
4	-0.701258	-2.140285	-0.310336	-0.234722	-0.302444	-0.322272	-0.554827	-0.441460	-0.440553	-0.521236	-0.376356	-0.419376	-0.176011	-0.202115	-0.256800	0.283165	0.198977
5	1.039400	0.464856	2.504641	1.808428	2.599812	-0.161205	1.164502	1.562982	1.272947	-0.286097	-0.150710	3.134487	1.297181	1.439786	0.560536	0.253915	0.337460
6	-0.132975	0.400605	0.541559	0.671442	0.046770	-0.331239	0.980011	1.904813	0.171671	-0.412879	-0.329926	0.618406	0.434597	0.144731	-0.158053	0.444399	0.268773
7	-0.336228	-0.347383	-0.287908	-0.214195	-0.286875	0.067425	-0.201713	-0.285924	-0.224146	0.307084	0.000231	-0.387540	-0.563820	-0.392784	-0.209266	0.014243	-3.202809

The inverse transformation of scaling is applied to understand better this values.

cluster_centers = scaler.inverse_transform(cluster_centers)
cluster_centers = pd.DataFrame(data = cluster_centers, columns=[df.columns])
cluster_centers

	BALANCE	BALANCE_FREQUENCY	PURCHASES	ONEOFF_PURCHASES	INSTALLMENTS_PURCHASES	CASH_ADVANCE	PURCHASES_FREQUENCY	ONEOFF_PURCHASES_FREQUENCY	PURCHASES_INSTALLMENTS_FREQUENCY	CASH_ADVANCE_FREQUENCY	CASH_ADVANCE_TRX	PURCHASES_TRX	CREDIT_LIMIT	PAYMENTS	MINIMUM_PAYMENTS	PRC_FULL_PAYMENT	TENURE
0	811.530431	0.956020	928.599272	189.879812	739.162116	217.022499	0.888900	0.087972	0.843436	0.040382	0.790387	19.419227	3543.741428	1110.786328	793.273098	0.245394	11.860258
1	1590.206589	0.972975	240.466411	191.411976	49.234195	781.342922	0.148076	0.084820	0.064565	0.155793	3.055939	2.740283	3390.725269	1011.232951	836.775777	0.020522	11.883037
2	5405.330935	0.957955	27276.363750	21877.102917	5399.260833	1550.378389	0.909028	0.842361	0.720833	0.059028	2.500000	127.958333	16083.333333	27925.634496	3266.671038	0.511206	11.916667
3	5074.010044	0.970375	539.143482	334.629072	204.630871	5191.855751	0.301422	0.140419	0.201051	0.518063	16.350515	8.107675	8239.753202	4103.940565	2151.320463	0.039617	11.612829
4	104.865352	0.370257	340.166450	202.846306	137.571031	303.051343	0.267672	0.070762	0.189350	0.030840	0.680473	4.285714	3854.048558	1148.040394	265.744196	0.236535	11.783601
5	3727.898336	0.987391	6354.408362	3594.057458	2762.045819	640.817196	0.957721	0.668725	0.870339	0.077893	2.220339	92.621469	9214.124294	5901.184611	2170.512276	0.227980	11.968927
6	1287.698840	0.972170	2160.255009	1706.894046	453.360963	284.247254	0.883676	0.770700	0.432664	0.052523	0.997326	30.081105	6075.692756	2152.127140	495.870743	0.283694	11.877005
7	864.645308	0.794979	388.085586	236.917416	151.649711	1120.263874	0.409393	0.117161	0.275356	0.196595	3.250401	5.077047	2443.040850	596.072587	376.521919	0.157880	7.231140

First Cluster of Clients (Transactors): Those are the clients who pay the least amount of interest charges and they are very careful with their money. The lowest balance ($ 104) and cash advance ($ 303). Full payment = 23%.
Second Cluster of clients (Revolvers): They use the credit card as a loan (the most lucrative sector): higher balance ($ 5000) and cash advance ( ~ $ 5000), low purchase frequency, high advance frequency cash (0.5), high cash advance transactions ( 16`) and low payment percentage (3%).
Third Cluster of Clients (VIP / Prime): High credit limit $ 16K and higher percentage of full payment, goal to increase credit limit and increase spending habits.
Fourth Cluster of Clients (low tenure): They are clients with low seniority (7 years), low balance.

labels.shape

(8950,)

labels.min()

labels.max()

– We can also make predictions:

y_kmeans = kmeans.fit_predict(df_scaled)
y_kmeans

array([0, 2, 5, ..., 6, 6, 6], dtype=int32)

Let’s go to concatenate clusters’ labels with the original dataset. In this way we can see which cluster each observation belongs to:

df_cluster = pd.concat([df, pd.DataFrame({'cluster': labels})], axis = 1)
df_cluster.head()

	BALANCE	BALANCE_FREQUENCY	PURCHASES	ONEOFF_PURCHASES	INSTALLMENTS_PURCHASES	CASH_ADVANCE	PURCHASES_FREQUENCY	ONEOFF_PURCHASES_FREQUENCY	PURCHASES_INSTALLMENTS_FREQUENCY	CASH_ADVANCE_FREQUENCY	CASH_ADVANCE_TRX	PURCHASES_TRX	CREDIT_LIMIT	PAYMENTS	MINIMUM_PAYMENTS	PRC_FULL_PAYMENT	TENURE	cluster
0	40.900749	0.818182	95.40	0.00	95.4	0.000000	0.166667	0.000000	0.083333	0.000000	0	2	1000.0	201.802084	139.509787	0.000000	12	1
1	3202.467416	0.909091	0.00	0.00	0.0	6442.945483	0.000000	0.000000	0.000000	0.250000	4	0	7000.0	4103.032597	1072.340217	0.222222	12	3
2	2495.148862	1.000000	773.17	773.17	0.0	0.000000	1.000000	1.000000	0.000000	0.000000	0	12	7500.0	622.066742	627.284787	0.000000	12	6
3	1666.670542	0.636364	1499.00	1499.00	0.0	205.788017	0.083333	0.083333	0.000000	0.083333	1	1	7500.0	0.000000	864.206542	0.000000	12	1
4	817.714335	1.000000	16.00	16.00	0.0	0.000000	0.083333	0.083333	0.000000	0.000000	0	1	1200.0	678.334763	244.791237	0.000000	12	1

– We visualize histograms for each cluster:

for i in df.columns:
    plt.figure(figsize=(35, 5))
    for j in range(8):
        plt.subplot(1, 8, j+1)
        cluster = df_cluster[df_cluster['cluster'] == j]
        cluster[i].hist(bins = 20)
        plt.title('{}    \nCluster {}'.format(i, j))
    plt.show()

5 – Principal Component Analysis (PCA)

pca = PCA(n_components = 2)
principal_comp = pca.fit_transform(df_scaled)
principal_comp

array([[-1.68221922, -1.07645126],
       [-1.13829358,  2.50647075],
       [ 0.96968292, -0.38350887],
       ...,
       [-0.92620252, -1.81078537],
       [-2.33655126, -0.65796865],
       [-0.55642498, -0.40046541]])

We’re going to create a dataframe with the two components:

pca_df = pd.DataFrame(data = principal_comp, columns=["pca1", "pca2"])
pca_df.head()

	pca1	pca2
0	-1.682219	-1.076451
1	-1.138294	2.506471
2	0.969683	-0.383509
3	-0.873627	0.043164
4	-1.599433	-0.688581

Let’s go to concatenate clusters’ labels with the principals components’ dataset:

pca_df = pd.concat([pca_df, pd.DataFrame({'cluster':labels})], axis = 1)
pca_df.head()

	pca1	pca2	cluster
0	-1.682219	-1.076451	1
1	-1.138294	2.506471	3
2	0.969683	-0.383509	6
3	-0.873627	0.043164	1
4	-1.599433	-0.688581	1

plt.figure(figsize=(10,10))
ax = sns.scatterplot(x = "pca1", y = "pca2", hue = "cluster", data = pca_df,
                     palette = ["red", "green", "blue", "pink", "yellow", "gray", "purple", "black"])
plt.show()

PCA has reduced our 17 dimensions to only 2 (it must be taken into account the K-means method was applied before the PCA, that’s why some overlapping groups appear).
The pink’s cluster are much more scatted than the rest. While the yellow ones have a lot of information that goes to pca2, or the blue ones (they are grouped around pca1).
The rest of them are near to 0, which means that they have similar values.

6 – Autoencoders: Neuronal Network to reduce the dimension of our dataset.

AveragePooling2D, MaxPooling2D, Dropout
from tensorflow.keras.models import Model, load_model
from tensorflow.keras.initializers import glorot_uniform
from keras.optimizers import SGD #stochastic grade descent

#Compress all info into 7 dimensions.
encoding_dim = 7 

input_df = Input(shape = (17, ))

Let’s go to create de layers:

#The ReLU's activation function (Rectified Linear Unit)
x = Dense(encoding_dim, activation = 'relu')(input_df)

#3 hidden layers.
# Glorot normal inicializador (Xavier normal initializer) it takes a random samples of a truncated normal distribution
x = Dense(500, activation = 'relu', kernel_initializer = 'glorot_uniform')(x) 
x = Dense(500, activation = 'relu', kernel_initializer = 'glorot_uniform')(x) 
x = Dense(2000, activation = 'relu', kernel_initializer = 'glorot_uniform')(x) 

#We finally encode everything to only 10 neurons (the info in the encoder layer)
encoded = Dense(10, activation = 'relu', kernel_initializer = 'glorot_uniform')(x) 

#decompress (it does't have to be symmetrical, we avoid multicollinearity)
x = Dense(2000, activation = 'relu', kernel_initializer = 'glorot_uniform')(encoded) 
x = Dense(500, activation = 'relu', kernel_initializer = 'glorot_uniform')(x) 

#we finally decode everything to the initial 17 variables
decoded = Dense(17, kernel_initializer = 'glorot_uniform')(x)

#we define the autoencoder with the layer model
autoencoder = Model(input_df, decoded) 
autoencoder.compile(optimizer = 'adam', loss = 'mean_squared_error')

#we save the compression part, we'll use it later to verify our model
encoder = Model(input_df, encoded)

df_scaled.shape

(8950, 17)

autoencoder.summary()

Model: "functional_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         [(None, 17)]              0         
_________________________________________________________________
dense (Dense)                (None, 7)                 126       
_________________________________________________________________
dense_1 (Dense)              (None, 500)               4000      
_________________________________________________________________
dense_2 (Dense)              (None, 500)               250500    
_________________________________________________________________
dense_3 (Dense)              (None, 2000)              1002000   
_________________________________________________________________
dense_4 (Dense)              (None, 10)                20010     
_________________________________________________________________
dense_5 (Dense)              (None, 2000)              22000     
_________________________________________________________________
dense_6 (Dense)              (None, 500)               1000500   
_________________________________________________________________
dense_7 (Dense)              (None, 17)                8517      
=================================================================
Total params: 2,307,653
Trainable params: 2,307,653
Non-trainable params: 0
_________________________________________________________________

autoencoder.fit(df_scaled, df_scaled, batch_size=128, epochs = 25, verbose = 1)

Epoch 1/25
70/70 [==============================] - 3s 42ms/step - loss: 0.6645
Epoch 2/25
70/70 [==============================] - 3s 37ms/step - loss: 0.3590
Epoch 3/25
70/70 [==============================] - 3s 39ms/step - loss: 0.2527
Epoch 4/25
70/70 [==============================] - 2s 35ms/step - loss: 0.1980
Epoch 5/25
70/70 [==============================] - 2s 35ms/step - loss: 0.1706
Epoch 6/25
70/70 [==============================] - 3s 44ms/step - loss: 0.1584
Epoch 7/25
70/70 [==============================] - 3s 37ms/step - loss: 0.1404
Epoch 8/25
70/70 [==============================] - 3s 40ms/step - loss: 0.1285
Epoch 9/25
70/70 [==============================] - 3s 37ms/step - loss: 0.1284
Epoch 10/25
70/70 [==============================] - 3s 45ms/step - loss: 0.1135
Epoch 11/25
70/70 [==============================] - 3s 45ms/step - loss: 0.1050
Epoch 12/25
70/70 [==============================] - 3s 38ms/step - loss: 0.1013
Epoch 13/25
70/70 [==============================] - 3s 43ms/step - loss: 0.0951
Epoch 14/25
70/70 [==============================] - 3s 40ms/step - loss: 0.0915
Epoch 15/25
70/70 [==============================] - 3s 36ms/step - loss: 0.0897
Epoch 16/25
70/70 [==============================] - 3s 40ms/step - loss: 0.0866
Epoch 17/25
70/70 [==============================] - 3s 39ms/step - loss: 0.0858
Epoch 18/25
70/70 [==============================] - 3s 38ms/step - loss: 0.0830
Epoch 19/25
70/70 [==============================] - 2s 33ms/step - loss: 0.0788
Epoch 20/25
70/70 [==============================] - 2s 35ms/step - loss: 0.0809
Epoch 21/25
70/70 [==============================] - 2s 33ms/step - loss: 0.0763
Epoch 22/25
70/70 [==============================] - 2s 33ms/step - loss: 0.0716
Epoch 23/25
70/70 [==============================] - 2s 33ms/step - loss: 0.0705
Epoch 24/25
70/70 [==============================] - 3s 37ms/step - loss: 0.0675
Epoch 25/25
70/70 [==============================] - 2s 33ms/step - loss: 0.0807

<tensorflow.python.keras.callbacks.History at 0x7fd8cc034250>

We obtain a loss function of 0.0657.

– We save the weights to make the predictions:

autoencoder.save_weights('autoencoder.h5') #.h5 neuronal netwotk standard extension

– We apply the encoder created above (the first half of our network belongs to the encode part) to predict:

pred = encoder.predict(df_scaled)
pred.shape

(8950, 10)

We have 10 characteristics. They correspond to half of the neural network (as we can see in the summary above).

Let’s apply K-Means:

scores_2 = []

range_values = range(1,20)

for i in range_values:
    kmeans = KMeans(n_clusters = i)
    kmeans.fit(pred)
    scores_2.append(kmeans.inertia_)


plt.plot(range_values, scores_2, 'bx-')
plt.title("Encontrar el número óptimo de clusters")
plt.xlabel("Número de Clusters")
plt.ylabel("WCSS(k)")
plt.show()

The elbow stabilizes around $k=4$.

We’re going to compare the 2 elbows obtained in this work following different paths:

plt.plot(range_values, scores_1, 'bx-', color = "r")
plt.plot(range_values, scores_2, 'bx-', color = "g")

We keep $k=4$

kmeans = KMeans(4)
kmeans.fit(pred)
labels = kmeans.labels_
y_kmeans = kmeans.fit_predict(pred)

df_cluster_dr = pd.concat([df, pd.DataFrame({'cluster': labels})], axis = 1)
df_cluster_dr.head()

	BALANCE	BALANCE_FREQUENCY	PURCHASES	ONEOFF_PURCHASES	INSTALLMENTS_PURCHASES	CASH_ADVANCE	PURCHASES_FREQUENCY	ONEOFF_PURCHASES_FREQUENCY	PURCHASES_INSTALLMENTS_FREQUENCY	CASH_ADVANCE_FREQUENCY	CASH_ADVANCE_TRX	PURCHASES_TRX	CREDIT_LIMIT	PAYMENTS	MINIMUM_PAYMENTS	PRC_FULL_PAYMENT	TENURE	cluster
0	40.900749	0.818182	95.40	0.00	95.4	0.000000	0.166667	0.000000	0.083333	0.000000	0	2	1000.0	201.802084	139.509787	0.000000	12	1
1	3202.467416	0.909091	0.00	0.00	0.0	6442.945483	0.000000	0.000000	0.000000	0.250000	4	0	7000.0	4103.032597	1072.340217	0.222222	12	0
2	2495.148862	1.000000	773.17	773.17	0.0	0.000000	1.000000	1.000000	0.000000	0.000000	0	12	7500.0	622.066742	627.284787	0.000000	12	0
3	1666.670542	0.636364	1499.00	1499.00	0.0	205.788017	0.083333	0.083333	0.000000	0.083333	1	1	7500.0	0.000000	864.206542	0.000000	12	0
4	817.714335	1.000000	16.00	16.00	0.0	0.000000	0.083333	0.083333	0.000000	0.000000	0	1	1200.0	678.334763	244.791237	0.000000	12	1

pca = PCA(n_components=2)
princ_comp = pca.fit_transform(pred)
pca_df = pd.DataFrame(data = princ_comp, columns=["pca1", "pca2"])
pca_df.head()

	pca1	pca2
0	-1.705835	-0.162720
1	3.051461	-1.535872
2	0.292494	1.344226
3	0.675992	0.319985
4	-1.705685	-0.063674

pca_df = pd.concat([pca_df, pd.DataFrame({"cluster":labels})], axis = 1)
pca_df

	pca1	pca2	cluster
0	-1.705835	-0.162720	1
1	3.051461	-1.535872	0
2	0.292494	1.344226	0
3	0.675992	0.319985	0
4	-1.705685	-0.063674	1
...	...	...	...
8945	-1.279127	-0.162793	1
8946	-1.260987	-0.569104	1
8947	-1.207046	-0.294098	1
8948	-0.051287	-0.897754	1
8949	0.251878	-0.291193	0

plt.figure(figsize=(10,10))
ax = sns.scatterplot(x="pca1", y = "pca2", hue="cluster", data = pca_df, palette=["red", "green", "blue", "yellow"])
plt.show()

We can see how only with the first principal component (pca1) we could decide which cluster each observation belongs.

Segmentation to define a marketing strategy