Sales forecasting for the sales department

[This work is based on this course: Data Science for Business | 6 Real-world Case Studies.]

Predictive models can be very interesing in departments like Sales. Predictive models forecast future sales based on historical data.

The goal is to predict future daily sales based on the characteristics provided in the dataset. This dataset has information on 1115 company stores.

The Company has supplied us 3 .csv files:

  • train.csv – Historical data including Sales.
  • test.csv – Historical data excluding Sales.
  • store.csv – Supplemental information about the stores.

This features are some of them:

  • Id: An Id that represents a (Store, Date) duple within the test set.

  • Store: A unique Id for each store.

  • Sales: The turnover for any given day (this is what you are predicting).

  • Customers: The number of customers on a given day.

  • Open: An indicator for whether the store was open: 0 = closed, 1 = open.

  • StateHoliday: Indicates a state holiday. Normally all stores, with few exceptions, are closed on state holidays. Note that all schools are closed on public holidays and weekends. a = public holiday, b = Easter holiday, c = Christmas, 0 = None.

  • SchoolHoliday: Indicates if the (Store, Date) was affected by the closure of public schools.

  • StoreType: Differentiates between 4 different store models: a, b, c, d.

  • Assortment: Describes an assortment level: a = basic, b = extra, c = extended.

  • CompetitionDistance: Distance in meters to the nearest competitor store.

  • CompetitionOpenSince[Month/Year]: Gives the approximate year and month of the time the nearest competitor was opened.

  • Promo: Indicates whether a store is running a promo on that day.

  • Promo2: Promo2 is a continuing and consecutive promotion for some stores: 0 = store is not participating, 1 = store is participating..

  • Promo2Since[Year/Week]: Describes the year and calendar week when the store started participating in Promo2.

  • PromoInterval: Describes the consecutive intervals Promo2 is started, naming the months the promotion is started anew. E.g. "Feb,May,Aug,Nov" means each round starts in February, May, August, November of any given year for that store.

Source: https://www.kaggle.com/c/cs3244-rossmann-store-sales/data

1- Import libraries and dataset

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import datetime

Let’s load train.csv:

df = pd.read_csv("train.csv")
df.shape
(1017209, 9)
  • We have almost 1M of observations.
  • 1115 stores
  • ‘Sales’ is our target variable
df.head().to_csv()
',Store,DayOfWeek,Date,Sales,Customers,Open,Promo,StateHoliday,SchoolHoliday\n0,1,5,2015-07-31,5263,555,1,1,0,1\n1,2,5,2015-07-31,6064,625,1,1,0,1\n2,3,5,2015-07-31,8314,821,1,1,0,1\n3,4,5,2015-07-31,13995,1498,1,1,0,1\n4,5,5,2015-07-31,4822,559,1,1,0,1\n'
df.tail()
StoreDayOfWeekDateSalesCustomersOpenPromoStateHolidaySchoolHoliday
1017204111122013-01-010000a1
1017205111222013-01-010000a1
1017206111322013-01-010000a1
1017207111422013-01-010000a1
1017208111522013-01-010000a1
df.info()
<class 'pandas.core.frame.DataFrame'>
    RangeIndex: 1017209 entries, 0 to 1017208
    Data columns (total 9 columns):
     #   Column         Non-Null Count    Dtype 
    ---  ------         --------------    ----- 
     0   Store          1017209 non-null  int64 
     1   DayOfWeek      1017209 non-null  int64 
     2   Date           1017209 non-null  object
     3   Sales          1017209 non-null  int64 
     4   Customers      1017209 non-null  int64 
     5   Open           1017209 non-null  int64 
     6   Promo          1017209 non-null  int64 
     7   StateHoliday   1017209 non-null  object
     8   SchoolHoliday  1017209 non-null  int64 
    dtypes: int64(7), object(2)
    memory usage: 69.8+ MB
  • 9 columns.
  • 8 features (each with 1017209 points).
  • 1 target variable (Sales).
df.describe()
StoreDayOfWeekSalesCustomersOpenPromoSchoolHoliday
count1.017209e+061.017209e+061.017209e+061.017209e+061.017209e+061.017209e+061.017209e+06
mean5.584297e+023.998341e+005.773819e+036.331459e+028.301067e-013.815145e-011.786467e-01
std3.219087e+021.997391e+003.849926e+034.644117e+023.755392e-014.857586e-013.830564e-01
min1.000000e+001.000000e+000.000000e+000.000000e+000.000000e+000.000000e+000.000000e+00
25%2.800000e+022.000000e+003.727000e+034.050000e+021.000000e+000.000000e+000.000000e+00
50%5.580000e+024.000000e+005.744000e+036.090000e+021.000000e+000.000000e+000.000000e+00
75%8.380000e+026.000000e+007.856000e+038.370000e+021.000000e+001.000000e+000.000000e+00
max1.115000e+037.000000e+004.155100e+047.388000e+031.000000e+001.000000e+001.000000e+00
  • Average sales a day = 5773 Euros.
  • Minimal sales a day = 0 and Max sales a day = 41551.
  • Average customers = 633, minimum number of customers = 0, max number of customers = 7388.

Let’s load store.csv:

store_df = pd.read_csv("store.csv")
store_df.head()
StoreStoreTypeAssortmentCompetitionDistanceCompetitionOpenSinceMonthCompetitionOpenSinceYearPromo2Promo2SinceWeekPromo2SinceYearPromoInterval
01ca1270.09.02008.00NaNNaNNaN
12aa570.011.02007.0113.02010.0Jan,Apr,Jul,Oct
23aa14130.012.02006.0114.02011.0Jan,Apr,Jul,Oct
34cc620.09.02009.00NaNNaNNaN
45aa29910.04.02015.00NaNNaNNaN
store_df.info()
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 1115 entries, 0 to 1114
    Data columns (total 10 columns):
     #   Column                     Non-Null Count  Dtype  
    ---  ------                     --------------  -----  
     0   Store                      1115 non-null   int64  
     1   StoreType                  1115 non-null   object 
     2   Assortment                 1115 non-null   object 
     3   CompetitionDistance        1112 non-null   float64
     4   CompetitionOpenSinceMonth  761 non-null    float64
     5   CompetitionOpenSinceYear   761 non-null    float64
     6   Promo2                     1115 non-null   int64  
     7   Promo2SinceWeek            571 non-null    float64
     8   Promo2SinceYear            571 non-null    float64
     9   PromoInterval              571 non-null    object 
    dtypes: float64(5), int64(2), object(3)
    memory usage: 87.2+ KB
store_df.describe()
StoreCompetitionDistanceCompetitionOpenSinceMonthCompetitionOpenSinceYearPromo2Promo2SinceWeekPromo2SinceYear
count1115.000001112.000000761.000000761.0000001115.000000571.000000571.000000
mean558.000005404.9010797.2247042008.6688570.51210823.5954472011.763573
std322.017087663.1747203.2123486.1959830.50007814.1419841.674935
min1.0000020.0000001.0000001900.0000000.0000001.0000002009.000000
25%279.50000717.5000004.0000002006.0000000.00000013.0000002011.000000
50%558.000002325.0000008.0000002010.0000001.00000022.0000002012.000000
75%836.500006882.50000010.0000002013.0000001.00000037.0000002013.000000
max1115.0000075860.00000012.0000002015.0000001.00000050.0000002015.000000
  • CompetitionDistance’s average is 5,4 kms.

2 – Missing data and data visualization

sns.heatmap(df.isnull(), yticklabels=False, cbar = False, cmap = "Blues")
df.hist(bins = 30, figsize=(20,20), color = 'r')
  • Averge customers per day: 600. Max.: 4500
  • Data are distributed evently (~ 150000 observations x 7 days = ~ 1,1M observations)
  • Stores are open ~ 80% of the time
  • Data have a similar distribution in all the stores.
  • Average sales are about 5000-6000 €
  • School holidays last abaout 18% of the time
df["Customers"].max()
7388

– Open and Closed stores:

closed_train_df = df[df['Open'] == 0]
open_train_df   = df[df['Open'] == 1]
print("Total = {} ".format(len(df)))
print("Stores Open = {}".format(len(open_train_df)))
print("Stores Closed = {}".format(len(closed_train_df)))
print("Percentage of stores closed = {}%".format(100.0*len(closed_train_df)/len(df)))
    Total = 1017209 
    Stores Open = 844392
    Stores Closed = 172817
    Percentage of stores closed = 16.98933060954042%

– We only keep the sotres open and eliminate the other ones:

df = df[df['Open'] == 1]
df
StoreDayOfWeekDateSalesCustomersOpenPromoStateHolidaySchoolHoliday
0152015-07-3152635551101
1252015-07-3160646251101
2352015-07-3183148211101
3452015-07-311399514981101
4552015-07-3148225591101
..............................
101677668222013-01-01337556610a1
101682773322013-01-0110765237710a1
101686376922013-01-015035124810a1
101704294822013-01-014491103910a1
1017190109722013-01-015961140510a1

– We remove the open column:

df.drop(['Open'], axis = 1, inplace = True)
df
StoreDayOfWeekDateSalesCustomersPromoStateHolidaySchoolHoliday
0152015-07-315263555101
1252015-07-316064625101
2352015-07-318314821101
3452015-07-31139951498101
4552015-07-314822559101
...........................
101677668222013-01-0133755660a1
101682773322013-01-011076523770a1
101686376922013-01-01503512480a1
101704294822013-01-01449110390a1
1017190109722013-01-01596114050a1
df.describe()
StoreDayOfWeekSalesCustomersPromoSchoolHoliday
count844392.000000844392.000000844392.000000844392.000000844392.000000844392.000000
mean558.4229203.5203616955.514291762.7283950.4463520.193580
std321.7319141.7236893104.214680401.2276740.4971140.395103
min1.0000001.0000000.0000000.0000000.0000000.000000
25%280.0000002.0000004859.000000519.0000000.0000000.000000
50%558.0000003.0000006369.000000676.0000000.0000000.000000
75%837.0000005.0000008360.000000893.0000001.0000000.000000
max1115.0000007.00000041551.0000007388.0000001.0000001.000000
  • Average sales = 6955 €.
  • Average customers = 762 (the figures have risen).

Missing values

sns.heatmap(store_df.isnull(), yticklabels=False, cbar=False, cmap = "Blues")
store_df[store_df['CompetitionDistance'].isnull()]
StoreStoreTypeAssortmentCompetitionDistanceCompetitionOpenSinceMonthCompetitionOpenSinceYearPromo2Promo2SinceWeekPromo2SinceYearPromoInterval
290291daNaNNaNNaN0NaNNaNNaN
621622acNaNNaNNaN0NaNNaNNaN
878879daNaNNaNNaN15.02013.0Feb,May,Aug,Nov
store_df[store_df['CompetitionOpenSinceMonth'].isnull()]
StoreStoreTypeAssortmentCompetitionDistanceCompetitionOpenSinceMonthCompetitionOpenSinceYearPromo2Promo2SinceWeekPromo2SinceYearPromoInterval
1112ac1070.0NaNNaN113.02010.0Jan,Apr,Jul,Oct
1213da310.0NaNNaN145.02009.0Feb,May,Aug,Nov
1516ac3270.0NaNNaN0NaNNaNNaN
1819ac3240.0NaNNaN122.02011.0Mar,Jun,Sept,Dec
2122aa1040.0NaNNaN122.02012.0Jan,Apr,Jul,Oct
.................................
10951096ac1130.0NaNNaN110.02014.0Mar,Jun,Sept,Dec
10991100aa540.0NaNNaN114.02011.0Jan,Apr,Jul,Oct
11121113ac9260.0NaNNaN0NaNNaNNaN
11131114ac870.0NaNNaN0NaNNaNNaN
11141115dc5350.0NaNNaN122.02012.0Mar,Jun,Sept,Dec
  • Missing data from 3 observations in ‘CompetitionDistance’.
  • Missing data from 354 observations in ‘CompetitionOpenSinceMonth’ (almost a third of 1115 stores).
store_df[store_df['Promo2'] == 0]
StoreStoreTypeAssortmentCompetitionDistanceCompetitionOpenSinceMonthCompetitionOpenSinceYearPromo2Promo2SinceWeekPromo2SinceYearPromoInterval
01ca1270.09.02008.00NaNNaNNaN
34cc620.09.02009.00NaNNaNNaN
45aa29910.04.02015.00NaNNaNNaN
56aa310.012.02013.00NaNNaNNaN
67ac24000.04.02013.00NaNNaNNaN
.................................
11071108aa540.04.02004.00NaNNaNNaN
11091110cc900.09.02010.00NaNNaNNaN
11111112cc1880.04.02006.00NaNNaNNaN
11121113ac9260.0NaNNaN0NaNNaNNaN
11131114ac870.0NaNNaN0NaNNaNNaN
  • It seems if ‘promo2’ is zero, then ‘promo2SinceWeek’, ‘Promo2SinceYear’ and ‘PromoInterval’ are zero. This is logical because if we don’t have a promotion we woudn’t have a promotion date.

– We have missing data from 354 observations in ‘CompetitionOpenSinceMonth’ and ‘CompetitionOpenSinceYear’, we’re going to put them to zero:

str_cols = ['Promo2SinceWeek', 'Promo2SinceYear', 'PromoInterval', 'CompetitionOpenSinceYear', 'CompetitionOpenSinceMonth']

for str in str_cols:
    store_df[str].fillna(0, inplace = True)
sns.heatmap(store_df.isnull(), yticklabels=False, cbar=False, cmap = "Blues")

– We fill the missing values of ‘CompetitionDistance’ with average values:

store_df['CompetitionDistance'].fillna(store_df['CompetitionDistance'].mean(), inplace=True)
sns.heatmap(store_df.isnull(), yticklabels=False, cbar=False, cmap = "Blues")
store_df.hist(bins = 30, figsize=(20,20), color = 'r')
  • The number 2 promotion is being followed by half of the stores.
  • A Half of the stores have their rivals at a distance of 0-3000 m.

Let’s explore the combined dataset (sore.csv + test.csv)

– We’re going to mix both dataset using an inner:

inner_df = pd.merge(df, store_df, how = 'inner', on = 'Store') #'on' is a column merge
inner_df.to_csv('test.csv', index = False)
inner_df
StoreDayOfWeekDateSalesCustomersPromoStateHolidaySchoolHolidayStoreTypeAssortmentCompetitionDistanceCompetitionOpenSinceMonthCompetitionOpenSinceYearPromo2Promo2SinceWeekPromo2SinceYearPromoInterval
0152015-07-315263555101ca1270.09.02008.000.00.00
1142015-07-305020546101ca1270.09.02008.000.00.00
2132015-07-294782523101ca1270.09.02008.000.00.00
3122015-07-285011560101ca1270.09.02008.000.00.00
4112015-07-276102612101ca1270.09.02008.000.00.00
......................................................
84438729212013-01-0792911002100aa1100.06.02009.000.00.00
84438829262013-01-052748340000aa1100.06.02009.000.00.00
84438929252013-01-044202560001aa1100.06.02009.000.00.00
84439029242013-01-034580662001aa1100.06.02009.000.00.00
84439129232013-01-025076672001aa1100.06.02009.000.00.00

– Correlation:

correlations = inner_df.corr()['Sales'].sort_values()
correlations
    DayOfWeek                   -0.178736
    Promo2SinceYear             -0.127621
    Promo2                      -0.127596
    Promo2SinceWeek             -0.058476
    CompetitionDistance         -0.036343
    CompetitionOpenSinceMonth   -0.018370
    CompetitionOpenSinceYear     0.005266
    Store                        0.007710
    SchoolHoliday                0.038617
    Promo                        0.368145
    Customers                    0.823597
    Sales                        1.000000
    Name: Sales, dtype: float64

The positves ones are correlated and de negative ones not. Near to zero don’t tell us anything.

  • Customers and promotion are positively correlated with sales.
  • ‘Promo2’ doesn’t seems very effective to us.
  • Sales decrease as the week forward (‘Sales’ is in 1 and ‘DayOfWeek’* in -0.178.
correlations = inner_df.corr()
f, ax = plt.subplots(figsize = (20,20))
sns.heatmap(correlations, annot=True)
  • Clients, Promo2 and Sales have a strong correlation.

Let’s split year, month and day and then put them in a separate column:

inner_df['Year'] = pd.DatetimeIndex(inner_df['Date']).year
inner_df['Month'] = pd.DatetimeIndex(inner_df['Date']).month
inner_df['Day'] = pd.DatetimeIndex(inner_df['Date']).day
inner_df
StoreDayOfWeekDateSalesCustomersPromoStateHolidaySchoolHolidayStoreTypeAssortmentCompetitionDistanceCompetitionOpenSinceMonthCompetitionOpenSinceYearPromo2Promo2SinceWeekPromo2SinceYearPromoIntervalYearMonthDay
0152015-07-315263555101ca1270.09.02008.000.00.002015731
1142015-07-305020546101ca1270.09.02008.000.00.002015730
2132015-07-294782523101ca1270.09.02008.000.00.002015729
3122015-07-285011560101ca1270.09.02008.000.00.002015728
4112015-07-276102612101ca1270.09.02008.000.00.002015727
...............................................................
84438729212013-01-0792911002100aa1100.06.02009.000.00.00201317
84438829262013-01-052748340000aa1100.06.02009.000.00.00201315
84438929252013-01-044202560001aa1100.06.02009.000.00.00201314
84439029242013-01-034580662001aa1100.06.02009.000.00.00201313
84439129232013-01-025076672001aa1100.06.02009.000.00.00201312

Average sales vs. number of customers per month:

axis = inner_df.groupby('Month')[['Sales']].mean().plot(figsize = (10, 5), marker = 'o', color = 'r')
axis.set_title("Average sales per month")

plt.figure()
axis = inner_df.groupby('Month')[['Customers']].mean().plot(figsize = (10, 5), marker = '^', color = 'b')
axis.set_title("Average customers per month")
  • It seems that the sales and customers reached its lowest point in Christmas.

Average sales vs. customers per day of the month:

axis = inner_df.groupby('Day')[['Sales']].mean().plot(figsize = (10, 5), marker = 'o', color = 'r')
axis.set_title("Average sales per day of the month")

axis = inner_df.groupby('Day')[['Customers']].mean().plot(figsize = (10, 5), marker = '^', color = 'b')
axis.set_title("Average customers per day of the month")
  • The minimum number of clients is around the 24th of the month.
  • Most customers and sales are between the 30th and the 1st of the month.

Average sales vs. customers per day of the week (7 = Sunday)

axis = inner_df.groupby('DayOfWeek')[['Sales']].mean().plot(figsize = (10, 5), marker = 'o', color = 'r')
axis.set_title("Average sales per day of the week")

axis = inner_df.groupby('DayOfWeek')[['Customers']].mean().plot(figsize = (10, 5), marker = '^', color = 'b')
axis.set_title("Average customers per day of the week")
  • The day of greatest sales and clients is Sunday.
  • When we look at sales we can see there is not much difference between Sunday and Monday.

Average date vs. Store Type per Sales

fig, ax = plt.subplots(figsize = (20, 10))
inner_df.groupby(['Date', 'StoreType']).mean()['Sales'].unstack().plot(ax = ax)
plt.figure(figsize=[15,10])

plt.subplot(211)
sns.barplot(x = 'Promo', y = 'Sales', data = inner_df)

plt.subplot(212)
sns.barplot(x = 'Promo', y = 'Customers', data = inner_df)
  • When we have a promotion, then sales and clients go up.
plt.figure(figsize=[15,10])

plt.subplot(211)
sns.violinplot(x = 'Promo', y = 'Sales', data = inner_df)

plt.subplot(212)
sns.violinplot(x = 'Promo', y = 'Customers', data = inner_df)

3 – Forecasting with ‘Facebook Prophet’

Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data.

Source: https://facebook.github.io/prophet/

Training Model A

!pip install fbprophet
from fbprophet import Prophet
def sales_predictions(Store_ID, sales_df, periods): #intervals of time to predict
    '''
    This function modify the name of the variables like 'ds', beacuse Facebook algotithm forces us to put it this way.
    '''
    sales_df = sales_df[sales_df['Store'] == Store_ID]
    sales_df = sales_df[['Date', 'Sales']].rename(columns = {'Date': 'ds', 'Sales': 'y'})
    sales_df = sales_df.sort_values('ds')

    model    = Prophet()
    model.fit(sales_df)
    
    #prediction
    future   = model.make_future_dataframe(periods = periods)
    forecast = model.predict(future) 
    
    figure   = model.plot(forecast, xlabel = "Date", ylabel = "Sales")
    figure2  = model.plot_components(forecast)  
sales_predictions(10, inner_df, 60)

Training Model B

  • StateHoliday: It tells us weather it was a holiday day or not (a= normal holiday, b = Easter holiday, c = Christmas, 0 = Not a holiday)
  • SchoolHoliday: It tells us weather Sotre or Date is affected by the closure of public schools or not.

– We manually add the holidays to the algorithm:

def sales_predictions(Store_ID, sales_df, holidays, periods):
    sales_df = sales_df[sales_df['Store'] == Store_ID]
    sales_df = sales_df[['Date', 'Sales']].rename(columns = {'Date': 'ds', 'Sales': 'y'})
    sales_df = sales_df.sort_values('ds')

    model    = Prophet(holidays=holidays)
    model.fit(sales_df)
    future   = model.make_future_dataframe(periods = periods)
    forecast = model.predict(future)
    figure   = model.plot(forecast, xlabel = "Fecha", ylabel = "Ventas")
    figure2  = model.plot_components(forecast)

– We get all the dates related to school holidays:

school_holidays = inner_df[inner_df['SchoolHoliday'] == 1].loc[:, 'Date'].values
school_holidays.shape
(163457,)
school_holidays = np.unique(school_holidays)
school_holidays.shape
(477,)

– We get all the dates corresponding to State holidays:

state_holidays = inner_df[(inner_df['StateHoliday'] == 'a') | (inner_df['StateHoliday'] == 'b') | (inner_df['StateHoliday'] == 'c')].loc[:, 'Date'].values
state_holidays.shape
(910,)
state_holidays = np.unique(state_holidays)
state_holidays.shape
(35,)
  • 35 days’ holydays.
school_holidays = pd.DataFrame({'ds': pd.to_datetime(school_holidays), 
                                'holiday': 'school_holiday'})
school_holidays
dsholiday
02013-01-01school_holiday
12013-01-02school_holiday
22013-01-03school_holiday
32013-01-04school_holiday
42013-01-05school_holiday
.........
4722015-07-27school_holiday
4732015-07-28school_holiday
4742015-07-29school_holiday
4752015-07-30school_holiday
4762015-07-31school_holiday
state_holidays = pd.DataFrame({'ds': pd.to_datetime(state_holidays), 
                                'holiday': 'state_holiday'})
state_holidays
dsholiday
02013-01-01state_holiday
12013-01-06state_holiday
22013-03-29state_holiday
32013-04-01state_holiday
42013-05-01state_holiday
52013-05-09state_holiday
62013-05-20state_holiday
72013-05-30state_holiday
82013-08-15state_holiday
92013-10-03state_holiday
102013-10-31state_holiday
112013-11-01state_holiday
122013-12-25state_holiday
132013-12-26state_holiday
142014-01-01state_holiday
152014-01-06state_holiday
162014-04-18state_holiday
172014-04-21state_holiday
182014-05-01state_holiday
192014-05-29state_holiday
202014-06-09state_holiday
212014-06-19state_holiday
222014-10-03state_holiday
232014-10-31state_holiday
242014-11-01state_holiday
252014-12-25state_holiday
262014-12-26state_holiday
272015-01-01state_holiday
282015-01-06state_holiday
292015-04-03state_holiday
302015-04-06state_holiday
312015-05-01state_holiday
322015-05-14state_holiday
332015-05-25state_holiday
342015-06-04state_holiday

– We concatenate school vacations and State holidays:

school_state_holidays = pd.concat((state_holidays, school_holidays), axis = 0)
school_state_holidays
dsholiday
02013-01-01state_holiday
12013-01-06state_holiday
22013-03-29state_holiday
32013-04-01state_holiday
42013-05-01state_holiday
.........
4722015-07-27school_holiday
4732015-07-28school_holiday
4742015-07-29school_holiday
4752015-07-30school_holiday
4762015-07-31school_holiday

– Let’s make predictions using holidays for a specific store:

sales_predictions(6, inner_df, school_state_holidays, 90)

If we look at the sales, we can observer a downward trend according Facebook’s algorithm’s prediction.