This is a capstone project to mark the end of the Google Analytics Data course certification. We will be using the R Markdown for data cleaning, analysis and visualization.We will then share our findings and come up with an actionable plan. The following steps of data analysis process, will be followed:
Lily Moreno, The director of marketing of Cyclistic, a bike-share company in Chicago, believes the company’s future success depends on maximising the number of annual memberships. The team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, the team will design a new marketing strategy to convert casual riders into members. As a junior analyst, it is our job to find the answers and back them up with compelling data insights and professional visualisations.
The three main questions being asked to guide the future marketing program are:
We have been assigned the first question, How do annual members and casual riders use Cyclistic bikes differently?
To answer this question we will be loading the data sets from the year 2024. From the data sets, we will wrangle the data to find insights that will help us answer the question asked above.
In order to prepare for the task at hand, we will access the data from this website. The datasets have a different name because Cyclistic is a fictional company. For the purpose of this case study, the datasets are appropriate and will enable us to answer the business questions. The data has been made available Motivate International Inc. under this license. This verifies the integrity of the data being used.
As mentioned above, for the purpose of this study, we will be using R markdown as the tool of choice. R markdown allows users to progress by combining the functions of R script and an html file. the codes can be written in chunks, which will allow users to integrate their findings into the report.
First we will be loading the necessary packages into the script. These packages will allow us to perform the necessary analyses.
library(tidyverse) #helps wrangle data
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
#using the conflicted package to manage conflicts
library(conflicted)
#setting dplyr::filter and dplyr::lag as the default choices
library(lubridate) # Needed to extract hour
conflict_prefer("filter", "dplyr")
## [conflicted] Will prefer dplyr::filter over any other package.
conflict_prefer("lag", "dplyr")
## [conflicted] Will prefer dplyr::lag over any other package.
The next step is to import the data. After 12 months of data were downloaded, they were unzipped and renamed for the ease of access. Then they were imported into R markdown using the following codes.
jan_24<- read_csv("jan.csv")
## Rows: 144873 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
feb_24<- read_csv("feb.csv")
## Rows: 223164 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
mar_24<- read_csv("mar.csv")
## Rows: 301687 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
apr_24<- read_csv("apr.csv")
## Rows: 415025 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
may_24<- read_csv("may.csv")
## Rows: 609493 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
jun_24<- read_csv("jun.csv")
## Rows: 710721 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
jul_24<- read_csv("jul.csv")
## Rows: 748962 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
aug_24<- read_csv("aug.csv")
## Rows: 755639 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sep_24<- read_csv("sep.csv")
## Rows: 821276 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
oct_24<- read_csv("oct.csv")
## Rows: 616281 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
nov_24<- read_csv("nov.csv")
## Rows: 335075 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dec_24<- read_csv("dec.csv")
## Rows: 178372 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
The first step is to make sure the data is clean. using “colnames” helps us see if all 12 datasets have the same column names. If they do we can move forward, if not we will have to rename some of them to match the remaining data.
colnames(jan_24)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(feb_24)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(mar_24)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(apr_24)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(may_24)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(jun_24)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(jul_24)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(aug_24)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(sep_24)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(oct_24)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(nov_24)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
colnames(dec_24)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
The column names are all same throughout. But dealing with 12 different datasets would make it tedious. In the next step, we will combine them into one big dataset and call it “all_trips”.
all_trips <- bind_rows(jan_24,feb_24,mar_24,apr_24,may_24,jun_24,jul_24,aug_24,sep_24,oct_24,nov_24,dec_24)
The next step is to change “ride_id” and “rideable_type” into character strings to make it easier as we go through the analysis.
all_trips<- mutate(all_trips, ride_id = as.character(ride_id), rideable_type = as.character(rideable_type))
Now we check all the data to make sure its consistent. we will use the following codes to do so.
colnames(all_trips)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
nrow(all_trips)
## [1] 5860568
dim(all_trips)
## [1] 5860568 13
head(all_trips)
## # A tibble: 6 × 13
## ride_id rideable_type started_at ended_at
## <chr> <chr> <dttm> <dttm>
## 1 C1D650626C8C899A electric_bike 2024-01-12 15:30:27 2024-01-12 15:37:59
## 2 EECD38BDB25BFCB0 electric_bike 2024-01-08 15:45:46 2024-01-08 15:52:59
## 3 F4A9CE78061F17F7 electric_bike 2024-01-27 12:27:19 2024-01-27 12:35:19
## 4 0A0D9E15EE50B171 classic_bike 2024-01-29 16:26:17 2024-01-29 16:56:06
## 5 33FFC9805E3EFF9A classic_bike 2024-01-31 05:43:23 2024-01-31 06:09:35
## 6 C96080812CD285C5 classic_bike 2024-01-07 11:21:24 2024-01-07 11:30:03
## # ℹ 9 more variables: start_station_name <chr>, start_station_id <chr>,
## # end_station_name <chr>, end_station_id <chr>, start_lat <dbl>,
## # start_lng <dbl>, end_lat <dbl>, end_lng <dbl>, member_casual <chr>
str(all_trips)
## tibble [5,860,568 × 13] (S3: tbl_df/tbl/data.frame)
## $ ride_id : chr [1:5860568] "C1D650626C8C899A" "EECD38BDB25BFCB0" "F4A9CE78061F17F7" "0A0D9E15EE50B171" ...
## $ rideable_type : chr [1:5860568] "electric_bike" "electric_bike" "electric_bike" "classic_bike" ...
## $ started_at : POSIXct[1:5860568], format: "2024-01-12 15:30:27" "2024-01-08 15:45:46" ...
## $ ended_at : POSIXct[1:5860568], format: "2024-01-12 15:37:59" "2024-01-08 15:52:59" ...
## $ start_station_name: chr [1:5860568] "Wells St & Elm St" "Wells St & Elm St" "Wells St & Elm St" "Wells St & Randolph St" ...
## $ start_station_id : chr [1:5860568] "KA1504000135" "KA1504000135" "KA1504000135" "TA1305000030" ...
## $ end_station_name : chr [1:5860568] "Kingsbury St & Kinzie St" "Kingsbury St & Kinzie St" "Kingsbury St & Kinzie St" "Larrabee St & Webster Ave" ...
## $ end_station_id : chr [1:5860568] "KA1503000043" "KA1503000043" "KA1503000043" "13193" ...
## $ start_lat : num [1:5860568] 41.9 41.9 41.9 41.9 41.9 ...
## $ start_lng : num [1:5860568] -87.6 -87.6 -87.6 -87.6 -87.7 ...
## $ end_lat : num [1:5860568] 41.9 41.9 41.9 41.9 41.9 ...
## $ end_lng : num [1:5860568] -87.6 -87.6 -87.6 -87.6 -87.6 ...
## $ member_casual : chr [1:5860568] "member" "member" "member" "member" ...
summary(all_trips)
## ride_id rideable_type started_at
## Length:5860568 Length:5860568 Min. :2024-01-01 00:00:39.00
## Class :character Class :character 1st Qu.:2024-05-20 19:47:53.00
## Mode :character Mode :character Median :2024-07-22 20:36:16.27
## Mean :2024-07-17 07:55:47.61
## 3rd Qu.:2024-09-17 20:14:22.56
## Max. :2024-12-31 23:56:49.84
##
## ended_at start_station_name start_station_id
## Min. :2024-01-01 00:04:20.00 Length:5860568 Length:5860568
## 1st Qu.:2024-05-20 20:07:54.75 Class :character Class :character
## Median :2024-07-22 20:53:59.16 Mode :character Mode :character
## Mean :2024-07-17 08:13:06.54
## 3rd Qu.:2024-09-17 20:27:46.02
## Max. :2024-12-31 23:59:55.70
##
## end_station_name end_station_id start_lat start_lng
## Length:5860568 Length:5860568 Min. :41.64 Min. :-87.91
## Class :character Class :character 1st Qu.:41.88 1st Qu.:-87.66
## Mode :character Mode :character Median :41.90 Median :-87.64
## Mean :41.90 Mean :-87.65
## 3rd Qu.:41.93 3rd Qu.:-87.63
## Max. :42.07 Max. :-87.52
##
## end_lat end_lng member_casual
## Min. :16.06 Min. :-144.05 Length:5860568
## 1st Qu.:41.88 1st Qu.: -87.66 Class :character
## Median :41.90 Median : -87.64 Mode :character
## Mean :41.90 Mean : -87.65
## 3rd Qu.:41.93 3rd Qu.: -87.63
## Max. :87.96 Max. : 152.53
## NA's :7232 NA's :7232
The chunk has given us a brief summary of all the data types.
As we can see in the tibbles above, the “started_at” and “ended_at” columns have merged the date and time. Let’s break this down.
all_trips$date <- as.Date(all_trips$started_at) #The default format is yyyy-mm-dd
all_trips$month <- format(as.Date(all_trips$date), "%m")
all_trips$day <- format(as.Date(all_trips$date), "%d")
all_trips$year <- format(as.Date(all_trips$date), "%Y")
all_trips$day_of_week <- format(as.Date(all_trips$date), "%A")
This chunk will separate the columns into year, month, day and day of the week.
We now move on towards finding the duration of ride (in seconds) from every rider. The next two code chunks will find the required answer and show us what it looks like.
all_trips$ride_length <- difftime(all_trips$ended_at,all_trips$started_at)
str(all_trips)
## tibble [5,860,568 × 19] (S3: tbl_df/tbl/data.frame)
## $ ride_id : chr [1:5860568] "C1D650626C8C899A" "EECD38BDB25BFCB0" "F4A9CE78061F17F7" "0A0D9E15EE50B171" ...
## $ rideable_type : chr [1:5860568] "electric_bike" "electric_bike" "electric_bike" "classic_bike" ...
## $ started_at : POSIXct[1:5860568], format: "2024-01-12 15:30:27" "2024-01-08 15:45:46" ...
## $ ended_at : POSIXct[1:5860568], format: "2024-01-12 15:37:59" "2024-01-08 15:52:59" ...
## $ start_station_name: chr [1:5860568] "Wells St & Elm St" "Wells St & Elm St" "Wells St & Elm St" "Wells St & Randolph St" ...
## $ start_station_id : chr [1:5860568] "KA1504000135" "KA1504000135" "KA1504000135" "TA1305000030" ...
## $ end_station_name : chr [1:5860568] "Kingsbury St & Kinzie St" "Kingsbury St & Kinzie St" "Kingsbury St & Kinzie St" "Larrabee St & Webster Ave" ...
## $ end_station_id : chr [1:5860568] "KA1503000043" "KA1503000043" "KA1503000043" "13193" ...
## $ start_lat : num [1:5860568] 41.9 41.9 41.9 41.9 41.9 ...
## $ start_lng : num [1:5860568] -87.6 -87.6 -87.6 -87.6 -87.7 ...
## $ end_lat : num [1:5860568] 41.9 41.9 41.9 41.9 41.9 ...
## $ end_lng : num [1:5860568] -87.6 -87.6 -87.6 -87.6 -87.6 ...
## $ member_casual : chr [1:5860568] "member" "member" "member" "member" ...
## $ date : Date[1:5860568], format: "2024-01-12" "2024-01-08" ...
## $ month : chr [1:5860568] "01" "01" "01" "01" ...
## $ day : chr [1:5860568] "12" "08" "27" "29" ...
## $ year : chr [1:5860568] "2024" "2024" "2024" "2024" ...
## $ day_of_week : chr [1:5860568] "Friday" "Monday" "Saturday" "Monday" ...
## $ ride_length : 'difftime' num [1:5860568] 452 433 480 1789 ...
## ..- attr(*, "units")= chr "secs"
all_trips$ride_length <- as.numeric(as.character(all_trips$ride_length))
is.numeric(all_trips$ride_length)
## [1] TRUE
The code above converts the new column we created into a number format if in case it wasn’t. But since our column is already in the required format, the chunk confirms it by saying “TRUE”.
If we check the tibble carefully, some of those numbers are negative or zero. The likely reason for this to happen is if engineers and mechanics are taking it out for checks and maintenance. Therefore, for the purpose of this study, we will remove them as we need true data we receive from consumers.
all_trips_v2 <- all_trips[!(all_trips$ride_length <= 0) & !is.na(all_trips$ride_length), ]
Now that the data is clean, we will begin analysing the data.
mean(all_trips_v2$ride_length)
## [1] 1039.147
median(all_trips_v2$ride_length)
## [1] 583
max(all_trips_v2$ride_length)
## [1] 93596
min(all_trips_v2$ride_length)
## [1] 0.0389998
Using the chunk code above, we are finding:
But since these numbers are for all users, it doesn’t give us much information. So, in our next step, we will find these numbers by grouping user types into Casual and Members.
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = mean)
## all_trips_v2$member_casual all_trips_v2$ride_length
## 1 casual 1509.2684
## 2 member 766.4129
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = median)
## all_trips_v2$member_casual all_trips_v2$ride_length
## 1 casual 723.330
## 2 member 521.707
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = max)
## all_trips_v2$member_casual all_trips_v2$ride_length
## 1 casual 93596
## 2 member 93588
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = min)
## all_trips_v2$member_casual all_trips_v2$ride_length
## 1 casual 0.05099988
## 2 member 0.03899980
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual + all_trips_v2$day_of_week, FUN = mean)
## all_trips_v2$member_casual all_trips_v2$day_of_week all_trips_v2$ride_length
## 1 casual Friday 1471.7761
## 2 member Friday 745.5064
## 3 casual Monday 1448.4387
## 4 member Monday 732.7006
## 5 casual Saturday 1691.7650
## 6 member Saturday 844.6969
## 7 casual Sunday 1765.4818
## 8 member Sunday 856.4958
## 9 casual Thursday 1314.6240
## 10 member Thursday 735.5206
## 11 casual Tuesday 1290.9601
## 12 member Tuesday 737.0622
## 13 casual Wednesday 1336.7755
## 14 member Wednesday 747.1673
Following this code, we can see that the data is grouped by member types followed by day of the week. In order to make it more presentable, we need to arrange it by the day of week.
The following chunk will allow us to do so.
all_trips_v2$day_of_week <- ordered(all_trips_v2$day_of_week, levels=c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))
When rewriting the code, it will give us the following results.
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual + all_trips_v2$day_of_week, FUN = mean)
## all_trips_v2$member_casual all_trips_v2$day_of_week all_trips_v2$ride_length
## 1 casual Sunday 1765.4818
## 2 member Sunday 856.4958
## 3 casual Monday 1448.4387
## 4 member Monday 732.7006
## 5 casual Tuesday 1290.9601
## 6 member Tuesday 737.0622
## 7 casual Wednesday 1336.7755
## 8 member Wednesday 747.1673
## 9 casual Thursday 1314.6240
## 10 member Thursday 735.5206
## 11 casual Friday 1471.7761
## 12 member Friday 745.5064
## 13 casual Saturday 1691.7650
## 14 member Saturday 844.6969
all_trips_v2 %>%
mutate(weekday = wday(started_at, label = TRUE)) %>%
group_by(member_casual, weekday) %>%
summarise(number_of_rides = n()
,average_duration = mean(ride_length)) %>%
arrange(member_casual, weekday)
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
## # A tibble: 14 × 4
## # Groups: member_casual [2]
## member_casual weekday number_of_rides average_duration
## <chr> <ord> <int> <dbl>
## 1 casual Sun 369889 1765.
## 2 casual Mon 253564 1448.
## 3 casual Tue 232370 1291.
## 4 casual Wed 269318 1337.
## 5 casual Thu 265072 1315.
## 6 casual Fri 315840 1472.
## 7 casual Sat 445344 1692.
## 8 member Sun 417200 856.
## 9 member Mon 534542 733.
## 10 member Tue 570619 737.
## 11 member Wed 610187 747.
## 12 member Thu 570519 736.
## 13 member Fri 525766 746.
## 14 member Sat 479615 845.
The codes below to find out in what ratio the rider type is divided to. It shows us that 63.3% of the riders are annual subscription holders, while 36.7% of the riders are casual riders.
all_trips_v2 %>%
group_by(member_casual) %>%
summarise(ride_count = length(ride_id), ride_percentage = (length(ride_id) / nrow(all_trips_v2)) * 100)
## # A tibble: 2 × 3
## member_casual ride_count ride_percentage
## <chr> <int> <dbl>
## 1 casual 2151397 36.7
## 2 member 3708448 63.3
ggplot(all_trips_v2, aes(x = member_casual, fill=member_casual)) +
geom_bar() +
labs(x="Casuals vs Members", y="Number Of Rides", title= "Casuals vs Members distribution")
The final step of this case study involves an actionable plan. With the help of the analysis above, The following suggestions can be shared during the next stakeholders meeting, with hopes to increase annual members count.
Most casual riders seem to use the Cyclistic bikes less during winter season. Introducing a short term subscription, or a cheaper model targeted for the specific months, could encourage the casual members towards subscription.
As we could see that most annual riders use Cyclistic bikes during rush hours, potentially for their daily commute. A subscription plan that encourages casual riders to do the same could be useful.
Many casual riders tend to use one time payment module as they believe a subscription would be more expensive for them. If we can change the module to make one time payments more expensive and subscriptions much cheaper, it could sway many riders to do so.
Allow casual members to enjoy subscriptions for a month free to see what benefits they could be enjoying as annual members.