Bike-sharing company

[This work is based on this course: Google Data Analytics Professional Certificate.]

Cyclistic is a fictional bike-share company located in Chicago. Since its foundation in 2016, Cyclistic has since grown to a fleet of 5,824 bicycles. This bicycles are geotracked and locked into a network of 692 stations across Chicago. Every one of them can be unlocked from one station and returned to any other station anytime. Cyclistic has 3 flexible pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members.

The company wants to increase revenue with a new marketing strategy to convert casual riders into annual members. Therefore I am tasked with analysing bike usage data to understand how casual riders and annual members use Cyclistic bikes differently, to help create a data driven marketing strategy.

In particular, I am interested in the following:

  1. How many members and casuals customers do we have? What are their proportions of total trips?
  2. Calculate the average ride lengths for members and casuals respectively.
  3. Calculate the most common starting and ending stations for casuals and members respectively.
  4. Are there something relevant on any days of the week by category?
  5. Can we see any preference for the rideable type for casuals and members?

The data used for this analysis was collected by Cyclistic, and pertains to rider patterns over the past twelve months from September 2020 to August 2021.

The data has been obtained from the following source. The data has been made available by Motivate International Inc. under this license.) This is public data that we can use to explore how different customer types are using Cyclistic bikes.

Features of the dataset:

  • ride_id: Unique ID per ride

  • rideable_type: Type of bicycle used

  • started_at: Date and time that the bicycle was checked out

  • ended_at: The date and time that the bicycle was checked in

  • start_station_name: Name of the station at the start of the trip

  • start_station_id: Unique identifier for the start station

  • end_station_name: Name of the station at the end of the trip

  • end_station_id: Unique identifier for the end station

  • start_lat: Latitude of the start station

  • start_lng: Longitude of the start station

  • end_lat: Latitude of the end station

  • end_lng: Longitude of the end station

  • member_casual: Member or a casual user

Fields we added during the analysis:

  • ride_length: Length of the ride calculated as ended_at – started_at.

  • day_of_week: Day of the week for started_at

For this case study, the datasets are appropriate and will enable to answer the business questions. The data has been made available by Motivate International Inc. under this license.) This is public data that we can use to explore how different customer types are using Cyclistic bikes. But note that data-privacy issues prohibit you from using riders’ personally identifiable information. This means that we won’t be able to connect pass purchases to credit card numbers to determine if casual riders live in the Cyclistic service area or if they have purchased multiple single passes.

Steps:

1- Excel:

Initially I’ve used Excel for cleaning and processing each .csv file. I’ve created a column called ‘ride_length’. Calculate the length of each ride by subtracting the column “started_at” from the column “ended_at” (for example, =D2-C2) and format as HH:MM:SS using Format > Cells > Time > 37:30:55.

I’ve also build a column called ‘day_of_week’ and calculate the day of the week that each ride started using the ‘WEEKDAY’ command (for example, =WEEKDAY(C2,1)) in each file. Format as General or as a number with no decimals, noting that 1 = Sunday and 7 = Saturday.

I have also created three pivot tables in each csv:

  • The first pivot table represent the average ride length by each type of member per month.
  • The second pivot table represent the total average ride length per month.
  • The third table shows the number of daily trips by day of week and month.
  • The fourth table the number of total daily trips by month.

2- Tableau:

I have used Tableau to graphically represent the 4 pilot tables created in the excel files. You can see them here, in the Tableau tab or in the following links:

2- RStudio:

I have used RStudio for almost all the manipulation. I have also used the sqldf library to make SQL queries and show them later. Such as maps with the Top 5 stations for the end and start of the journey, Total rides per day of week and total rides per rideable type.

1- Libraries

library("tidyverse")
library("lubridate")
library("rgdal")
library("markdown")
library("maps")
library("sqldf")
library("ggrepel")
library("sp")

2- Extraction and Manipulation dataframes:

-Reading and creating dataframes for each .csv:

setwd("/home/Documents/Google_course/tripdata_2020_09_2021_08_xml/")

sep_20 <- read.csv("202009-divvy-tripdata.csv", sep=",")
oct_20 <- read.csv("202010-divvy-tripdata.csv", sep=",")
nov_20 <- read.csv("202011-divvy-tripdata.csv", sep=",")
dec_20 <- read.csv("202012-divvy-tripdata.csv", sep=",")
jan_21 <- read.csv("202101-divvy-tripdata.csv", sep=",")
feb_21 <- read.csv("202102-divvy-tripdata.csv", sep=",")
mar_21 <- read.csv("202103-divvy-tripdata.csv", sep=",")
apr_21 <- read.csv("202104-divvy-tripdata.csv", sep=",")
may_21 <- read.csv("202105-divvy-tripdata.csv", sep=",")
jun_21 <- read.csv("202106-divvy-tripdata.csv", sep=",")
jul_21 <- read.csv("202107-divvy-tripdata.csv", sep=",")
aug_21 <- read.csv("202108-divvy-tripdata.csv", sep=",")

-Merging all dataframes:

df <- do.call("rbind", list(sep_20, oct_20, nov_20, dec_20, jan_21, feb_21, mar_21, apr_21, may_21, jun_21, jul_21, aug_21))

View(head(df))
ride_idrideable_typestarted_atended_atstart_station_namestart_station_idend_station_nameend_station_idstart_latstart_lngend_latend_lngmember_casualride_lengthday_of_weekride_length_min
1F436C1376E544CA4docked_bike2020-09-01 00:00:072020-09-01 00:37:30Michigan Ave & Lake St52Michigan Ave & Lake St5241.886024-87.62411741.886024-87.624117casual00:37:23337,3833333333333
24BDBF1871364E2C7docked_bike2020-09-01 00:00:192020-09-01 00:16:13Broadway & Wilson Ave293Broadway & Berwyn Ave29441.965221-87.65813941.978353-87.659753member00:15:53315,9
30392A0D6FB466576electric_bike2020-09-01 00:00:332020-09-01 00:04:43Wells St & Evergreen Ave291State St & Pearson St10641.9066971666667-87.635123833333341.8976816666667-87.6288231666667casual00:04:0934,16666666666667
48684AF4D002CB59Edocked_bike2020-09-01 00:00:412020-09-01 00:16:21Broadway & Wilson Ave293Broadway & Berwyn Ave29441.965221-87.65813941.978353-87.659753casual00:15:39315,6666666666667
58240E1D13E461357docked_bike2020-09-01 00:02:072020-09-01 00:15:53Shore Dr & 55th St247Ellis Ave & 60th St42641.795212-87.58071541.78509714636-87.6010727606member00:13:46313,7666666666667
6ED3F7E1B5B1CBA23electric_bike2020-09-01 00:02:312020-09-01 00:06:31Benson Ave & Church St596Dodge Ave & Church St60042.0482308333333-87.683534166666742.0483108333333-87.6982918333333member00:04:0034

-Format transformation of Datetime:

df <- df %>%
  mutate(started_at = as_datetime(df$started_at, format = "%d/%m/%Y %H:%M")) %>%
  mutate(ended_at = as_datetime(df$ended_at, format = "%d/%m/%Y %H:%M")) %>%
  mutate(ride_length = as.difftime(df$ride_length, format = "%H:%M:%S"))

-Calculate mean and max value of the ride_length feature:

mean_ride_length <- as.numeric(mean(df$ride_length))/60
cat("Average ride length from 2020-09 to 2021-08:",mean_ride_length,"minutes.")
max_ride_length <- as.numeric(max(df$ride_length))/3600
cat("The longest ride from 2020-09 to 2021-08:",max_ride_length,"hours.")
Average ride length from 2020-09 to 2021-08: 22 minutes.
The longest ride from 2020-09 to 2021-08: 24 hours.

2- SQL queries

Now we are going to use SQL queries in RStudio using sqldf library.

2.1- Create a query with day_of_week (total, members, casuals)

day_week_total <- sqldf("SELECT day_of_week, member_casual, COUNT(day_of_week) AS Total
                 FROM df
                 GROUP BY member_casual, day_of_week
                 ORDER BY day_of_week DESC", 
method = "auto")

2.2- Top 5 starting geolocations for members:

members_start_geo <- sqldf("SELECT member_casual, start_station_name AS Start_Station, 
                start_lat AS Start_Latitude,
                start_lng As Start_Longitude, COUNT(start_station_name) AS Num_Trips
                FROM df
                WHERE start_station_name IS NOT ''
                AND member_casual = 'member'
                GROUP BY start_station_name
                ORDER BY COUNT(start_station_name) DESC
                LIMIT 5", 
method = "auto")

2.3- Top 5 starting geolocations for casuals:

casuals_start_geo <- sqldf("SELECT member_casual, start_station_name AS Start_Station, 
                start_lat AS Start_Latitude, start_lng As Start_Longitude,
                COUNT(start_station_name) AS Num_Trips
                FROM df
                WHERE start_station_name IS NOT ''
                AND member_casual = 'casual'
                GROUP BY start_station_name
                ORDER BY COUNT(start_station_name) DESC
                LIMIT 5", 
method = "auto")

2.4- Merging members_start_geo + casuals_start_geo tables:

start_all_geo <- rbind(members_start_geo, casuals_start_geo)
View(start_all_geo)
member_casualStart_StationStart_LatitudeStart_LongitudeNum_Trips
1memberClark St & Elm St41.902973-87.6312824032
2memberWells St & Concord Ln41.9120941666667-87.634750521309
3memberKingsbury St & Kinzie St41.889176-87.63850520516
4memberWells St & Elm St41.903222-87.63432419374
5memberDearborn St & Erie St41.893992-87.62931818550
6casualStreeter Dr & Grand Ave41.892278-87.61204356879
7casualMillennium Park41.881031-87.62408431442
8casualMichigan Ave & Oak St41.90096-87.62377627898
9casualLake Shore Dr & Monroe St41.880958-87.61674327148
10casualTheater on the Lake41.926277-87.63083422061

– Casuals: Theater on the Lake, Michigan Ave. & Oak St., Streeter Dr. & Gradn Ave., Millennium Park and Lake Shore Dr. & Monroe St..

– Members: Wells St. & Concord Ln., Wells St. & Elm St., Park St. & Elm St., Dearborn St. & Erie St. and Kingsbury St. & Kinize St..

2.5- Top 5 ending geolocations for members:

members_end_geo <- sqldf("SELECT member_casual, end_station_name AS End_Station, 
                end_lat AS End_Latitude,
                end_lng As End_Longitude, COUNT(end_station_name) AS Num_Trips
                FROM df
                WHERE end_station_name IS NOT ''
                AND member_casual = 'member'
                GROUP BY end_station_name
                ORDER BY COUNT(end_station_name) DESC
                LIMIT 5", 
method = "auto")

2.6- Top 5 ending geolocations for casuals:

casuals_end_geo <- sqldf("SELECT member_casual, end_station_name AS End_Station, 
                end_lat AS End_Latitude, end_lng As End_Longitude,
                COUNT(end_station_name) AS Num_Trips
                FROM df
                WHERE end_station_name IS NOT ''
                AND member_casual = 'casual'
                GROUP BY end_station_name
                ORDER BY COUNT(end_station_name) DESC
                LIMIT 5", 
method = "auto")

2.7- Merging members_end_geo + casuals_end_geo tables:

end_all_geo <- rbind(members_end_geo, casuals_end_geo)
View(end_all_geo)
member_casualEnd_StationEnd_LatitudeEnd_LongitudeNum_Trips
1memberClark St & Elm St41.902973-87.6312824487
2memberWells St & Concord Ln41.912133-87.63465621865
3memberKingsbury St & Kinzie St41.88917683258-87.638505771820785
4memberWells St & Elm St41.903222-87.63432419566
5memberDearborn St & Erie St41.893992-87.62931819114
6casualStreeter Dr & Grand Ave41.892278-87.61204359046
7casualMillennium Park41.8810317-87.6240843232614
8casualMichigan Ave & Oak St41.90096039-87.6237766428979
9casualLake Shore Dr & Monroe St41.880958-87.61674325667
10casualTheater on the Lake41.92625-87.630935833333323819

– Casuals: Theater on the Lake, Michigan Ave. & Oak St., Streeter Dr. & Gradn Ave., Millennium Park and Lake Shore Dr. & Monroe St..

– Members: Wells St. & Concord Ln., Wells St. & Elm St., Park St. & Elm St., Dearborn St. & Erie St. and Kingsbury St. & Kinize St..

3- Geolocation map of the top 5 start and end stations

###Getting a shapefile of Chicago, and fortifying it into a dataframe

-Changing the decimal point to a comma for our plots:

start_all_geo$Start_Latitude = as.numeric(gsub(",",".",start_all_geo$Start_Latitude,fixed=TRUE))
start_all_geo$Start_Longitude = as.numeric(gsub(",",".",start_all_geo$Start_Longitude,fixed=TRUE))
end_all_geo$End_Latitude = as.numeric(gsub(",",".",end_all_geo$End_Latitude, fixed=TRUE))
end_all_geo$End_Longitude = as.numeric(gsub(",",".",end_all_geo$End_Longitude, fixed=TRUE))

-Getting a shapefile of Chicago, and fortifying it into a dataframe: Source: https://data.cityofchicago.org/

chicago_map <- readOGR(dsn="/home/david/Documents/Google_course/Course/Case 1/map_chicago", layer="geo_export_8079d6b6-c882-4cf2-8acb-20d99ac6850f")
chicago_df = fortify(chicago_map)

3.1- Start station geolocations:

start_map <-ggplot() +
    geom_polygon(data = chicago_df, aes(x = long, y=lat , group = group), colour = 'grey', 
    fill = 'chartreuse4', size = .2) +
    geom_point(data = start_all_geo,
             aes(x = Start_Longitude, y = Start_Latitude, size = Num_Trips, color = member_casual), 
             alpha = 1) +
    geom_label_repel(data = start_all_geo,
                   aes(x = Start_Longitude, y = Start_Latitude, label = Start_Station),
                   size = 3,
                   box.padding   = 0.4, 
                   point.padding = 0.65,
                   segment.color = 'yellow') +
  scale_colour_manual(values=c(member = 'brown2',  casual= 'blueviolet'))+
  facet_wrap(~member_casual) +
  labs(title = "Top 5 Starting Stations", size = 'Nº Trips',
       color = 'Rider Type') +
  coord_cartesian(xlim = c(-87.7, -87.55), ylim = c(41.85, 41.95))+
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(panel.background = element_rect(fill = "lightblue")) +
        theme(panel.border = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank())
start_map

3.2- End station geolocations:

end_map <- ggplot() +
    geom_polygon(data = chicago_df, aes(x = long, y=lat , group = group), colour = 'grey', 
    fill = 'chartreuse4', size = .2) +
  geom_point(data = end_all_geo,
             aes(x = End_Longitude, y = End_Latitude, size = Num_Trips, color = member_casual),
             alpha = 1) +
  geom_label_repel(data = end_all_geo,
                   aes(x = End_Longitude, y = End_Latitude, label = End_Station),
                   size = 3,
                   box.padding   = 0.4, 
                   point.padding = 0.65,
                   segment.color = 'yellow') +
  scale_colour_manual(values=c(member = 'brown2',  casual= 'blueviolet')) +
  facet_wrap(~member_casual) +
  labs(title = "Top 5 Ending Stations.", size = 'Nº Trips',
       color = 'Rider Type') +
  coord_cartesian(xlim = c(-87.7, -87.55), ylim = c(41.85, 41.95)) + 
    theme(plot.title = element_text(hjust = 0.5)) +
    theme(panel.background = element_rect(fill = "lightblue")) +
    theme(panel.border = element_blank(),
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank())
end_map

We can observe the Top 5 ending and starting stations are the sames.

4- Tracing the different results obtained

-First, we are going to replace the numerical values with names of weekdays:

day_week_total$day_of_week[day_week_total$day_of_week == "1"] <- "Sunday"
day_week_total$day_of_week[day_week_total$day_of_week == "2"] <- "Monday"
day_week_total$day_of_week[day_week_total$day_of_week == "3"] <- "Tuesday"
day_week_total$day_of_week[day_week_total$day_of_week == "4"] <- "Wednesday"
day_week_total$day_of_week[day_week_total$day_of_week == "5"] <- "Thursday"
day_week_total$day_of_week[day_week_total$day_of_week == "6"] <- "Friday"
day_week_total$day_of_week[day_week_total$day_of_week == "7"] <- "Saturday"

4.1- Stacked bar plot with the yearly modes for all riders

**-Let’s to create a function in the order I established so that x axis isn’t sorted. This function finds the sum of casual and member riders, to be used to plot labels in the middle of each bar.

day_week_total$day_of_week <- factor(day_week_total$day_of_week, levels = rev(unique(day_week_total$day_of_week)), ordered=TRUE)
day_week_total <- day_week_total %>%
  arrange(day_of_week, rev(member_casual)) %>%
  group_by(day_of_week) %>%
  mutate(GTotal = cumsum(Total) - 0.5 * Total)

View(day_week_total)
day_of_weekmember_casualTotalGTotal
1Sundaymember337639168819.5
2Sundaycasual420590547934
3Mondaymember364920182460
4Mondaycasual250565490202.5
5Tuesdaymember400580200290
6Tuesdaycasual242367521763.5
7Wednesdaymember407021203510.5
8Wednesdaycasual241960528001
9Thursdaymember390808195404
10Thursdaycasual247971514793.5
11Fridaymember396755198377.5
12Fridaycasual322484557997
13Saturdaymember390260195130
14Saturdaycasual499152639836
yearly_plot <- ggplot(data = day_week_total, aes(x = day_of_week, y = Total, fill = member_casual)) +
             scale_fill_manual(values=c(member = 'brown2',  casual= 'blueviolet')) +
  geom_col() +
  geom_text(aes(y = GTotal, label = Total), vjust = 1.5, colour = "white") +
  labs(title = "Total rides per day of week from 2020-09 to 2021-08", x = "Day of Week",
       y = "Total Rides", fill = "Rider Type") +
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE)) +
      theme(plot.title = element_text(hjust = 0.5)) 
yearly_plot

We can see that the weekend casual users increase.

4.2- A side by side bar plot with Total Rides vs. Rideable Type for all riders

-Build a query to return results related to rideble types by members:

bike_type <- sqldf("SELECT rideable_type, member_casual, COUNT(rideable_type) as number_of_uses
                 FROM df
                 GROUP BY member_casual, rideable_type
                 ORDER BY count(rideable_type) DESC", 
method = "auto" )

View(bike_type)

-Removing the underscore in rideable types:

bike_type$rideable_type[bike_type$rideable_type == "classic_bike"] <- "Classic Bike"
bike_type$rideable_type[bike_type$rideable_type == "docked_bike"] <- "Docked Bike"
bike_type$rideable_type[bike_type$rideable_type == "electric_bike"] <- "Electric Bike"

View(bike_type)
rideable_typemember_casualnumber_of_uses
1Classic Bikemember1363298
2Classic Bikecasual925249
3Electric Bikemember821653
4Electric Bikecasual755623
5Docked Bikecasual544217
6Docked Bikemember503032
bike_type_plot <- ggplot(data = bike_type, aes(x = rideable_type, y = number_of_uses, fill = member_casual)) +
  scale_fill_manual(values=c(member = 'brown2',  casual= 'blueviolet')) +
  geom_col(position = "dodge") +
  geom_text(aes(label = number_of_uses),  vjust = -0.3 ,colour = "black", 
            position = position_dodge(.9)) +
  labs(title = "Total Rides Per Rideable Type from 2020-09 to 2021-08", x = "Rideable Type",
       y = "Total Rides", fill = "Rider Type") +
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE)) +
      theme(plot.title = element_text(hjust = 0.5)) 
bike_type_plot

Classic bikes are preferred by users, especially by members.

Tableau Pivot Tables

One the hand, casual users are the ones who travel the most distance in any month of the year studied. Highlighting the month of September with 36 minutes on average and January with 22. On the other hand, the members are more stable.

We can also see from September to November how the distance traveled is usually greater. New year and new purpose of cycling to work or school?

We doble check that September is the month with the highest average time traveled, this time decreases as winter progresses and gradually picks up again until summer.

We can see how the summer months are the months with the highest number of trips, especially Saturdays. This, together with what was observed in the previous graphs, can make us indicate that people in summer travel shorter journeys than in autumn.

We verify that the bulk of the trips are made in the months of June, July and August.