Chicago Crime Prediction

We are going to predicting crime rate in chicago with Facebook Prophet.

Our dataset contains a summary of the reported crimes occurred in the City of Chicago from 2001 to 2017 and contains the following columns:

  • ID: Unique identifier for the record.
  • Case Number: The Chicago Police Department RD Number (Records Division Number), which is unique to the incident.
  • Date: Date when the incident occurred.
  • Block: address where the incident occurred
  • IUCR: The Illinois Unifrom Crime Reporting code.
  • Primary Type: The primary description of the IUCR code.
  • Description: The secondary description of the IUCR code, a subcategory of the primary description.
  • Location Description: Description of the location where the incident occurred.
  • Arrest: Indicates whether an arrest was made.
  • Domestic: Indicates whether the incident was domestic-related as defined by the Illinois Domestic Violence Act.
  • Beat: Indicates the beat where the incident occurred. A beat is the smallest police geographic area – each beat has a dedicated police beat car.
  • District: Indicates the police district where the incident occurred.
  • Ward: The ward (City Council district) where the incident occurred.
  • Community Area: Indicates the community area where the incident occurred. Chicago has 77 community areas.
  • FBI Code: Indicates the crime classification as outlined in the FBI’s National Incident-Based Reporting System (NIBRS).
  • X Coordinate: The x coordinate of the location where the incident occurred in State Plane Illinois East NAD 1983 projection.
  • Y Coordinate: The y coordinate of the location where the incident occurred in State Plane Illinois East NAD 1983 projection.
  • Year: Year the incident occurred.
  • Updated On: Date and time the record was last updated.
  • Latitude: The latitude of the location where the incident occurred. This location is shifted from the actual location for partial redaction but falls on the same block.
  • Longitude: The longitude of the location where the incident occurred. This location is shifted from the actual location for partial redaction but falls on the same block.
  • Location: The location where the incident occurred in a format that allows for creation of maps and other geographic operations on this data portal. This location is shifted from the actual location for partial redaction but falls on the same block.

Data source: https://www.kaggle.com/currie32/crimes-in-chicago

Prophet

Prophet is open source software released by Facebook’s Core Data Science team.

Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data.

In this link you have more information about Prophet with Python:

1 – Import libraries and dataset

import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt
import random
import seaborn as sns
from fbprophet import Prophet
df_1 = pd.read_csv('Chicago_Crimes_2001_to_2004.csv', error_bad_lines=False)
df_2 = pd.read_csv('Chicago_Crimes_2005_to_2007.csv', error_bad_lines=False)
df_3 = pd.read_csv('Chicago_Crimes_2008_to_2011.csv', error_bad_lines=False)
df_4 = pd.read_csv('Chicago_Crimes_2012_to_2017.csv', error_bad_lines=False)
# Concatenaning all datasets
df = pd.concat([df_1, df_2, df_3, df_4], ignore_index=False, axis=0)
df.head()
Unnamed: 0IDCase NumberDateBlockIUCRPrimary TypeDescriptionLocation DescriptionArrest...WardCommunity AreaFBI CodeX CoordinateY CoordinateYearUpdated OnLatitudeLongitudeLocation
08794786321HM39941401/01/2004 12:01:00 AM082XX S COLES AVE0840THEFTFINANCIAL ID THEFT: OVER $300RESIDENCEFalse...7.046.006NaNNaN2004.008/17/2015 03:03:40 PMNaNNaNNaN
125444676906HM27893303/01/2003 12:00:00 AM004XX W 42ND PL2825OTHER OFFENSEHARASSMENT BY TELEPHONERESIDENCEFalse...11.061.0261173974.01876757.02003.004/15/2016 08:55:02 AM41.817229-87.637328(41.817229156, -87.637328162)
229194789749HM40222006/20/2004 11:00:00 AM025XX N KIMBALL AVE1752OFFENSE INVOLVING CHILDRENAGG CRIM SEX ABUSE FAM MEMBERRESIDENCEFalse...35.022.020NaNNaN2004.008/17/2015 03:03:40 PMNaNNaNNaN
329274789765HM40205812/30/2004 08:00:00 PM045XX W MONTANA ST0840THEFTFINANCIAL ID THEFT: OVER $300OTHERFalse...31.020.006NaNNaN2004.008/17/2015 03:03:40 PMNaNNaNNaN
433024677901HM27561505/01/2003 01:00:00 AM111XX S NORMAL AVE0841THEFTFINANCIAL ID THEFT:$300 &UNDERRESIDENCEFalse...34.049.0061174948.01831051.02003.004/15/2016 08:55:02 AM41.691785-87.635116(41.691784636, -87.635115968)

2 – Missing values

# Let's see how many null elements are contained in the data
total = df.isnull().sum().sort_values(ascending=False) 
# missing values percentage
percent = ((df.isnull().sum())*100)/df.isnull().count().sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total','Percent'], sort=False).sort_values('Total', ascending=False)
missing_data.head(40)
TotalPercent
Community Area7020918.841028
Ward7002248.817518
Location1055741.329433
Longitude1055741.329433
Latitude1055731.329420
Y Coordinate1055731.329420
X Coordinate1055731.329420
Location Description19900.025059
District910.001146
Case Number70.000088
Arrest00.000000
Date00.000000
Block00.000000
IUCR00.000000
Primary Type00.000000
Description00.000000
Year00.000000
Domestic00.000000
ID00.000000
Unnamed: 000.000000
FBI Code00.000000
Updated On00.000000
Beat00.000000

– Dropping unnamed:

df.drop(['Unnamed: 0', 'Case Number', 'IUCR', 'X Coordinate', 'Y Coordinate','Updated On','Year', 'FBI Code', 'Beat','Ward','Community Area', 'Location', 'District', 'Latitude' , 'Longitude'], inplace=True, axis=1)
df
IDDateBlockPrimary TypeDescriptionLocation DescriptionArrestDomestic
0478632101/01/2004 12:01:00 AM082XX S COLES AVETHEFTFINANCIAL ID THEFT: OVER $300RESIDENCEFalseFalse
1467690603/01/2003 12:00:00 AM004XX W 42ND PLOTHER OFFENSEHARASSMENT BY TELEPHONERESIDENCEFalseTrue
2478974906/20/2004 11:00:00 AM025XX N KIMBALL AVEOFFENSE INVOLVING CHILDRENAGG CRIM SEX ABUSE FAM MEMBERRESIDENCEFalseFalse
3478976512/30/2004 08:00:00 PM045XX W MONTANA STTHEFTFINANCIAL ID THEFT: OVER $300OTHERFalseFalse
4467790105/01/2003 01:00:00 AM111XX S NORMAL AVETHEFTFINANCIAL ID THEFT:$300 &UNDERRESIDENCEFalseFalse
...........................
14567091050867905/03/2016 11:33:00 PM026XX W 23RD PLBATTERYDOMESTIC BATTERY SIMPLEAPARTMENTTrueTrue
14567101050868005/03/2016 11:30:00 PM073XX S HARVARD AVECRIMINAL DAMAGETO PROPERTYAPARTMENTTrueTrue
14567111050868105/03/2016 12:15:00 AM024XX W 63RD STBATTERYAGGRAVATED: HANDGUNSIDEWALKFalseFalse
14567121050869005/03/2016 09:07:00 PM082XX S EXCHANGE AVEBATTERYDOMESTIC BATTERY SIMPLESIDEWALKFalseTrue
14567131050869205/03/2016 11:38:00 PM001XX E 75TH STOTHER OFFENSEOTHER WEAPONS VIOLATIONPARKING LOT/GARAGE(NON.RESID.)TrueFalse

– Assembling a datetime:

df.Date = pd.to_datetime(df.Date, format='%m/%d/%Y %I:%M:%S %p')
df.head()
IDDateBlockPrimary TypeDescriptionLocation DescriptionArrestDomestic
047863212004-01-01 00:01:00082XX S COLES AVETHEFTFINANCIAL ID THEFT: OVER $300RESIDENCEFalseFalse
146769062003-03-01 00:00:00004XX W 42ND PLOTHER OFFENSEHARASSMENT BY TELEPHONERESIDENCEFalseTrue
247897492004-06-20 11:00:00025XX N KIMBALL AVEOFFENSE INVOLVING CHILDRENAGG CRIM SEX ABUSE FAM MEMBERRESIDENCEFalseFalse
347897652004-12-30 20:00:00045XX W MONTANA STTHEFTFINANCIAL ID THEFT: OVER $300OTHERFalseFalse
446779012003-05-01 01:00:00111XX S NORMAL AVETHEFTFINANCIAL ID THEFT:$300 &UNDERRESIDENCEFalseFalse

– Change the index to the date:

df.index = pd.DatetimeIndex(df.Date)
df.head()
IDDateBlockPrimary TypeDescriptionLocation DescriptionArrestDomestic
Date
2004-01-01 00:01:0047863212004-01-01 00:01:00082XX S COLES AVETHEFTFINANCIAL ID THEFT: OVER $300RESIDENCEFalseFalse
2003-03-01 00:00:0046769062003-03-01 00:00:00004XX W 42ND PLOTHER OFFENSEHARASSMENT BY TELEPHONERESIDENCEFalseTrue
2004-06-20 11:00:0047897492004-06-20 11:00:00025XX N KIMBALL AVEOFFENSE INVOLVING CHILDRENAGG CRIM SEX ABUSE FAM MEMBERRESIDENCEFalseFalse
2004-12-30 20:00:0047897652004-12-30 20:00:00045XX W MONTANA STTHEFTFINANCIAL ID THEFT: OVER $300OTHERFalseFalse
2003-05-01 01:00:0046779012003-05-01 01:00:00111XX S NORMAL AVETHEFTFINANCIAL ID THEFT:$300 &UNDERRESIDENCEFalseFalse

Primary Type visualization

df['Primary Type'].value_counts().iloc[:15]
    THEFT                         1640506
    BATTERY                       1442716
    CRIMINAL DAMAGE                923000
    NARCOTICS                      885431
    OTHER OFFENSE                  491922
    ASSAULT                        481661
    BURGLARY                       470958
    MOTOR VEHICLE THEFT            370548
    ROBBERY                        300453
    DECEPTIVE PRACTICE             280931
    CRIMINAL TRESPASS              229366
    PROSTITUTION                    86401
    WEAPONS VIOLATION               77429
    PUBLIC PEACE VIOLATION          58548
    OFFENSE INVOLVING CHILDREN      51441
    Name: Primary Type, dtype: int64
df['Primary Type'].value_counts().iloc[:15].index
    Index(['THEFT', 'BATTERY', 'CRIMINAL DAMAGE', 'NARCOTICS', 'OTHER OFFENSE',
           'ASSAULT', 'BURGLARY', 'MOTOR VEHICLE THEFT', 'ROBBERY',
           'DECEPTIVE PRACTICE', 'CRIMINAL TRESPASS', 'PROSTITUTION',
           'WEAPONS VIOLATION', 'PUBLIC PEACE VIOLATION',
           'OFFENSE INVOLVING CHILDREN'],
          dtype='object')
plt.figure(figsize = (15, 10))
sns.countplot(y= 'Primary Type', data = df, order = df['Primary Type'].value_counts().iloc[:15].index)

Location Description visualization

plt.figure(figsize = (15, 10))
sns.countplot(y= 'Location Description', data = df, order = df['Location Description'].value_counts().iloc[:15].index)

3 – Data resample

Resample is a Convenience method for frequency conversion and resampling of time series.

More info here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.resample.html

– Per year:

df.resample('Y').size()
    Date
    2001-12-31    568518
    2002-12-31    490879
    2003-12-31    475913
    2004-12-31    388205
    2005-12-31    455811
    2006-12-31    794684
    2007-12-31    621848
    2008-12-31    852053
    2009-12-31    783900
    2010-12-31    700691
    2011-12-31    352066
    2012-12-31    335670
    2013-12-31    306703
    2014-12-31    274527
    2015-12-31    262995
    2016-12-31    265462
    2017-12-31     11357
    Freq: A-DEC, dtype: int64
plt.plot(df.resample('Y').size())
plt.title('Crimes Count Per Year')
plt.xlabel('Years')
plt.ylabel('Number of Crimes')

– Per month:

df.resample('M').size()
    Date
    2001-01-31    74995
    2001-02-28    66288
    2001-03-31    53122
    2001-04-30    40166
    2001-05-31    41876
                  ...  
    2016-09-30    23235
    2016-10-31    23314
    2016-11-30    21140
    2016-12-31    19580
    2017-01-31    11357
    Freq: M, Length: 193, dtype: int64
plt.plot(df.resample('M').size())
plt.title('Crimes Count Per Month')
plt.xlabel('Months')
plt.ylabel('Number of Crimes')

5 – Data Preparation

df_prophet = df.resample('M').size().reset_index()
df_prophet
Date0
02001-01-3174995
12001-02-2866288
22001-03-3153122
32001-04-3040166
42001-05-3141876
.........
1882016-09-3023235
1892016-10-3123314
1902016-11-3021140
1912016-12-3119580
1922017-01-3111357
df_prophet.columns = ['Date', 'Crime Count']
df_prophet.head()
DateCrime Count
02001-01-3174995
12001-02-2866288
22001-03-3153122
32001-04-3040166
42001-05-3141876
df_prophet = pd.DataFrame(df_prophet)
df_prophet
DateCrime Count
02001-01-3174995
12001-02-2866288
22001-03-3153122
32001-04-3040166
42001-05-3141876
.........
1882016-09-3023235
1892016-10-3123314
1902016-11-3021140
1912016-12-3119580
1922017-01-3111357

6 – Predictions with Prophet

df_prophet.columns
    Index(['Date', 'Crime Count'], dtype='object')
df_prophet_final = df_prophet.rename(columns={'Date':'ds', 'Crime Count':'y'})
df_prophet_final.head()
dsy
02001-01-3174995
12001-02-2866288
22001-03-3153122
32001-04-3040166
42001-05-3141876
m = Prophet()
m.fit(df_prophet_final)
# Forcasting into the future
future = m.make_future_dataframe(periods=365)
forecast = m.predict(future)

forecast
dstrendyhat_loweryhat_uppertrend_lowertrend_upperadditive_termsadditive_terms_loweradditive_terms_upperyearlyyearly_loweryearly_uppermultiplicative_termsmultiplicative_terms_lowermultiplicative_terms_upperyhat
02001-01-3140559.73154023628.24400554258.11823140559.73154040559.731540-1574.911817-1574.911817-1574.911817-1574.911817-1574.911817-1574.9118170.00.00.038984.819723
12001-02-2840707.00626618547.56708649988.57819440707.00626640707.006266-6454.662746-6454.662746-6454.662746-6454.662746-6454.662746-6454.6627460.00.00.034252.343520
22001-03-3140870.06042721495.16954355202.41726140870.06042740870.060427-2068.155039-2068.155039-2068.155039-2068.155039-2068.155039-2068.1550390.00.00.038801.905388
32001-04-3041027.85477523281.05142753789.29614841027.85477541027.854775-1473.221009-1473.221009-1473.221009-1473.221009-1473.221009-1473.2210090.00.00.039554.633766
42001-05-3141190.90893629675.63209461543.96058541190.90893641190.9089363883.2045713883.2045713883.2045713883.2045713883.2045713883.2045710.00.00.045074.113507
...................................................
5532018-01-279114.477200-11154.85673521864.4047208834.7635439395.155285-3423.766249-3423.766249-3423.766249-3423.766249-3423.766249-3423.7662490.00.00.05690.710951
5542018-01-289100.488430-8934.45353723419.4241898819.5131889382.545758-2770.269732-2770.269732-2770.269732-2770.269732-2770.269732-2770.2697320.00.00.06330.218698
5552018-01-299086.499659-8341.70757322854.6287848804.2628349369.895953-2232.904204-2232.904204-2232.904204-2232.904204-2232.904204-2232.9042040.00.00.06853.595455
5562018-01-309072.510889-8308.24153622550.2188008789.2918159357.246149-1837.148931-1837.148931-1837.148931-1837.148931-1837.148931-1837.1489310.00.00.07235.361958
5572018-01-319058.522118-9068.31485222729.4582628773.5998459344.596345-1605.129458-1605.129458-1605.129458-1605.129458-1605.129458-1605.1294580.00.00.07453.392660
figure = m.plot(forecast, xlabel='Date', ylabel='Crime Rate')
figure2 = m.plot_components(forecast)

We can see the prediction with prophet is right and we could predict crime rate in Chicago for next years with some precision.