Predicting Avocado Prices

We are going to predict avocado prices and therefore we will use Facebook Prophet tool.

The dataset represents weekly 2018 retail scan data for National retail volume (units) and price. Retail scan data comes directly from retailers’ cash registers based on actual retail sales of Hass avocados. Starting in 2013, the dataset reflects an expanded, multi-outlet retail data set. Multi-outlet reporting includes an aggregation of the following channels: grocery, mass, club, drug, dollar and military. The Average Price (of avocados) reflects a per unit (per avocado) cost, even when multiple units (avocados) are sold in bags. The Product Lookup codes (PLU’s) in dataset are only for Hass avocados. Other varieties of avocados (e.g. greenskins) are not included in this data.

Some relevant columns in the dataset:

  • Date: The date of the observation.
  • AveragePrice: The average price of a single avocado.
  • type: Conventional or organic.
  • year: The year.
  • Region: The city or region of the observation.
  • Total Volume: Total number of avocados sold.
  • 4046: Total number of avocados with PLU 4046 sold.
  • 4225: Total number of avocados with PLU 4225 sold.
  • 4770: Total number of avocados with PLU 4770 sold.

Data Source: https://www.kaggle.com/neuromusic/avocado-prices

Prophet

Prophet is open source software released by Facebook’s Core Data Science team.

Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data.

In this link you have more information about Prophet with Python:

1 – Import libraries and data exploration

import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt
import random
import seaborn as sns
from fbprophet import Prophet
df = pd.read_csv('avocado.csv')
df.head()
Unnamed: 0DateAveragePriceTotal Volume404642254770Total BagsSmall BagsLarge BagsXLarge Bagstypeyearregion
002015-12-271.3364236.621036.7454454.8548.168696.878603.6293.250.0conventional2015Albany
112015-12-201.3554876.98674.2844638.8158.339505.569408.0797.490.0conventional2015Albany
222015-12-130.93118220.22794.70109149.67130.508145.358042.21103.140.0conventional2015Albany
332015-12-061.0878992.151132.0071976.4172.585811.165677.40133.760.0conventional2015Albany
442015-11-291.2851039.60941.4843838.3975.786183.955986.26197.690.0conventional2015Albany
df = df.sort_values("Date")
df.info() 
    <class 'pandas.core.frame.DataFrame'>
    Int64Index: 18249 entries, 11569 to 8814
    Data columns (total 14 columns):
     #   Column        Non-Null Count  Dtype  
    ---  ------        --------------  -----  
     0   Unnamed: 0    18249 non-null  int64  
     1   Date          18249 non-null  object 
     2   AveragePrice  18249 non-null  float64
     3   Total Volume  18249 non-null  float64
     4   4046          18249 non-null  float64
     5   4225          18249 non-null  float64
     6   4770          18249 non-null  float64
     7   Total Bags    18249 non-null  float64
     8   Small Bags    18249 non-null  float64
     9   Large Bags    18249 non-null  float64
     10  XLarge Bags   18249 non-null  float64
     11  type          18249 non-null  object 
     12  year          18249 non-null  int64  
     13  region        18249 non-null  object 
    dtypes: float64(9), int64(2), object(3)
    memory usage: 2.1+ MB

Missing values

# Let's see how many null elements are contained in the data
total = df.isnull().sum().sort_values(ascending=False) 
# missing values percentage
percent = ((df.isnull().sum())*100)/df.isnull().count().sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total','Percent'], sort=False).sort_values('Total', ascending=False)
missing_data.head(40)
TotalPercent
Unnamed: 000.0
Date00.0
AveragePrice00.0
Total Volume00.0
404600.0
422500.0
477000.0
Total Bags00.0
Small Bags00.0
Large Bags00.0
XLarge Bags00.0
type00.0
year00.0
region00.0

Price trend during the year

plt.figure(figsize=(10,10))
plt.plot(df['Date'], df['AveragePrice'])
  • We see that the price of the avocado rises when it’s September.

Regions

df['region'].value_counts()
    Jacksonville           338
    Tampa                  338
    BuffaloRochester       338
    Portland               338
    SanDiego               338
    NorthernNewEngland     338
    HarrisburgScranton     338
    SouthCentral           338
    PhoenixTucson          338
    RaleighGreensboro      338
    Indianapolis           338
    Plains                 338
    Orlando                338
    Houston                338
    SouthCarolina          338
    West                   338
    Midsouth               338
    CincinnatiDayton       338
    LasVegas               338
    Boston                 338
    Charlotte              338
    Albany                 338
    Nashville              338
    Southeast              338
    Columbus               338
    Philadelphia           338
    Chicago                338
    Louisville             338
    GrandRapids            338
    Atlanta                338
    BaltimoreWashington    338
    Roanoke                338
    Denver                 338
    NewYork                338
    Pittsburgh             338
    TotalUS                338
    Syracuse               338
    Spokane                338
    HartfordSpringfield    338
    RichmondNorfolk        338
    Boise                  338
    DallasFtWorth          338
    Sacramento             338
    California             338
    SanFrancisco           338
    Detroit                338
    GreatLakes             338
    StLouis                338
    MiamiFtLauderdale      338
    Northeast              338
    NewOrleansMobile       338
    Seattle                338
    LosAngeles             338
    WestTexNewMexico       335
    Name: region, dtype: int64

Year

plt.figure(figsize=[15,5])
sns.countplot(x = 'year', data = df)
plt.xticks(rotation = 45)
  • We see less sales in 2018 because the data we have goes up to the beginning of that year.

2 – Data Preparation

df_prophet = df[['Date', 'AveragePrice']] 
df_prophet.tail()
DateAveragePrice
85742018-03-251.36
90182018-03-250.70
181412018-03-251.42
176732018-03-251.70
88142018-03-251.34

3 – Predictions with Prophet

df_prophet = df_prophet.rename(columns={'Date':'ds', 'AveragePrice':'y'})
df_prophet.head()
dsy
115692015-01-041.75
95932015-01-041.49
100092015-01-041.68
18192015-01-041.52
93332015-01-041.64
m = Prophet()
m.fit(df_prophet)
# Forcasting into the future
future = m.make_future_dataframe(periods=365)
forecast = m.predict(future)
forecast
dstrendyhat_loweryhat_uppertrend_lowertrend_upperadditive_termsadditive_terms_loweradditive_terms_upperyearlyyearly_loweryearly_uppermultiplicative_termsmultiplicative_terms_lowermultiplicative_terms_upperyhat
02015-01-041.4999030.8677481.8823901.4999031.499903-0.115033-0.115033-0.115033-0.115033-0.115033-0.1150330.00.00.01.384871
12015-01-111.4946430.9112501.8751321.4946431.494643-0.106622-0.106622-0.106622-0.106622-0.106622-0.1066220.00.00.01.388021
22015-01-181.4893820.8731521.8708241.4893821.489382-0.106249-0.106249-0.106249-0.106249-0.106249-0.1062490.00.00.01.383133
32015-01-251.4841210.8639671.8402221.4841211.484121-0.125093-0.125093-0.125093-0.125093-0.125093-0.1250930.00.00.01.359028
42015-02-011.4788600.8270671.8181521.4788601.478860-0.153293-0.153293-0.153293-0.153293-0.153293-0.1532930.00.00.01.325567
...................................................
5292019-03-211.1665070.5676661.6364240.9651411.362886-0.086285-0.086285-0.086285-0.086285-0.086285-0.0862850.00.00.01.080222
5302019-03-221.1657840.5507701.6207990.9638621.363032-0.084558-0.084558-0.084558-0.084558-0.084558-0.0845580.00.00.01.081225
5312019-03-231.1650600.5619311.6208420.9626201.363179-0.082555-0.082555-0.082555-0.082555-0.082555-0.0825550.00.00.01.082505
5322019-03-241.1643370.5814301.6532470.9613781.363220-0.080296-0.080296-0.080296-0.080296-0.080296-0.0802960.00.00.01.084041
5332019-03-251.1636130.5307271.6059000.9598341.363324-0.077808-0.077808-0.077808-0.077808-0.077808-0.0778080.00.00.01.085805
figure = m.plot(forecast, xlabel='Date', ylabel='Price')
figure2 = m.plot_components(forecast)

4 – Nashville data analysis

df_nashville = df[df['region']=='Nashville']
df_nashville
Unnamed: 0DateAveragePriceTotal Volume404642254770Total BagsSmall BagsLarge BagsXLarge Bagstypeyearregion
1403512015-01-041.00162162.75113865.8311083.5811699.0325514.3119681.135611.51221.67conventional2015Nashville
10529512015-01-041.843966.00244.342700.0276.21945.43838.34107.090.00organic2015Nashville
10528502015-01-111.922892.29204.752168.3380.56438.65435.543.110.00organic2015Nashville
1402502015-01-111.07149832.20103822.609098.8611665.7825244.9622478.922766.040.00conventional2015Nashville
1401492015-01-181.08143464.6497216.478423.5712187.7225636.8823520.542116.340.00conventional2015Nashville
.............................................
1791522018-03-111.3210160.9638.322553.360.007569.285132.052437.230.00organic2018Nashville
879112018-03-180.89316201.23141265.4011914.02387.61162634.20131128.6429834.211671.35conventional2018Nashville
1791412018-03-181.2710422.0520.412115.890.008285.754797.983487.770.00organic2018Nashville
879002018-03-250.95306280.52125788.5410713.80334.61169443.57136737.4430406.072300.06conventional2018Nashville
1791302018-03-251.487250.6943.771759.470.005447.454834.97612.480.00organic2018Nashville
df_nashville = df_nashville.sort_values("Date")
df_nashville
Unnamed: 0DateAveragePriceTotal Volume404642254770Total BagsSmall BagsLarge BagsXLarge Bagstypeyearregion
1403512015-01-041.00162162.75113865.8311083.5811699.0325514.3119681.135611.51221.67conventional2015Nashville
10529512015-01-041.843966.00244.342700.0276.21945.43838.34107.090.00organic2015Nashville
10528502015-01-111.922892.29204.752168.3380.56438.65435.543.110.00organic2015Nashville
1402502015-01-111.07149832.20103822.609098.8611665.7825244.9622478.922766.040.00conventional2015Nashville
1401492015-01-181.08143464.6497216.478423.5712187.7225636.8823520.542116.340.00conventional2015Nashville
.............................................
1791522018-03-111.3210160.9638.322553.360.007569.285132.052437.230.00organic2018Nashville
879112018-03-180.89316201.23141265.4011914.02387.61162634.20131128.6429834.211671.35conventional2018Nashville
1791412018-03-181.2710422.0520.412115.890.008285.754797.983487.770.00organic2018Nashville
879002018-03-250.95306280.52125788.5410713.80334.61169443.57136737.4430406.072300.06conventional2018Nashville
1791302018-03-251.487250.6943.771759.470.005447.454834.97612.480.00organic2018Nashville
df_nashville = df_nashville.rename(columns={'Date':'ds', 'AveragePrice':'y'})
m = Prophet()
m.fit(df_nashville)
# Forcasting into the future
future = m.make_future_dataframe(periods=365)
forecast = m.predict(future)
fig = m.plot(forecast, xlabel='Date', ylabel='Price')
fig2 = m.plot_components(forecast)