Introduction

This is a capstone project to mark the end of the Google Analytics Data course certification. We will be using the R Markdown for data cleaning, analysis and visualization.We will then share our findings and come up with an actionable plan. The following steps of data analysis process, will be followed:

Scenario

Lily Moreno, The director of marketing of Cyclistic, a bike-share company in Chicago, believes the company’s future success depends on maximising the number of annual memberships. The team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, the team will design a new marketing strategy to convert casual riders into members. As a junior analyst, it is our job to find the answers and back them up with compelling data insights and professional visualisations.

1. Ask

The three main questions being asked to guide the future marketing program are:

  1. How do annual members and casual riders use Cyclistic bikes differently?
  2. Why would casual riders buy Cyclistic annual memberships?
  3. How can Cyclistic use digital media to influence riders to become members?

Statement of business task

We have been assigned the first question, How do annual members and casual riders use Cyclistic bikes differently?

To answer this question we will be loading the data sets from the year 2024. From the data sets, we will wrangle the data to find insights that will help us answer the question asked above.

2. Prepare

In order to prepare for the task at hand, we will access the data from this website. The datasets have a different name because Cyclistic is a fictional company. For the purpose of this case study, the datasets are appropriate and will enable us to answer the business questions. The data has been made available Motivate International Inc. under this license. This verifies the integrity of the data being used.

As mentioned above, for the purpose of this study, we will be using R markdown as the tool of choice. R markdown allows users to progress by combining the functions of R script and an html file. the codes can be written in chunks, which will allow users to integrate their findings into the report.

First we will be loading the necessary packages into the script. These packages will allow us to perform the necessary analyses.

library(tidyverse) #helps wrangle data
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
#using the conflicted package to manage conflicts
library(conflicted)
#setting dplyr::filter and dplyr::lag as the default choices
library(lubridate)  # Needed to extract hour
conflict_prefer("filter", "dplyr")
## [conflicted] Will prefer dplyr::filter over any other package.
conflict_prefer("lag", "dplyr")
## [conflicted] Will prefer dplyr::lag over any other package.

The next step is to import the data. After 12 months of data were downloaded, they were unzipped and renamed for the ease of access. Then they were imported into R markdown using the following codes.

jan_24<- read_csv("jan.csv")
## Rows: 144873 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
feb_24<- read_csv("feb.csv")
## Rows: 223164 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
mar_24<- read_csv("mar.csv")
## Rows: 301687 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
apr_24<- read_csv("apr.csv")
## Rows: 415025 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
may_24<- read_csv("may.csv")
## Rows: 609493 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
jun_24<- read_csv("jun.csv")
## Rows: 710721 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
jul_24<- read_csv("jul.csv")
## Rows: 748962 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
aug_24<- read_csv("aug.csv")
## Rows: 755639 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sep_24<- read_csv("sep.csv")
## Rows: 821276 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
oct_24<- read_csv("oct.csv")
## Rows: 616281 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
nov_24<- read_csv("nov.csv")
## Rows: 335075 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dec_24<- read_csv("dec.csv")
## Rows: 178372 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

3. Process

The first step is to make sure the data is clean. using “colnames” helps us see if all 12 datasets have the same column names. If they do we can move forward, if not we will have to rename some of them to match the remaining data.

colnames(jan_24)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
colnames(feb_24)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
colnames(mar_24)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
colnames(apr_24)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
colnames(may_24)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
colnames(jun_24)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
colnames(jul_24)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
colnames(aug_24)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
colnames(sep_24)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
colnames(oct_24)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
colnames(nov_24)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
colnames(dec_24)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"

The column names are all same throughout. But dealing with 12 different datasets would make it tedious. In the next step, we will combine them into one big dataset and call it “all_trips”.

all_trips <- bind_rows(jan_24,feb_24,mar_24,apr_24,may_24,jun_24,jul_24,aug_24,sep_24,oct_24,nov_24,dec_24)

The next step is to change “ride_id” and “rideable_type” into character strings to make it easier as we go through the analysis.

all_trips<- mutate(all_trips, ride_id = as.character(ride_id), rideable_type = as.character(rideable_type))

Now we check all the data to make sure its consistent. we will use the following codes to do so.

colnames(all_trips)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
nrow(all_trips)
## [1] 5860568
dim(all_trips)
## [1] 5860568      13
head(all_trips)
## # A tibble: 6 × 13
##   ride_id          rideable_type started_at          ended_at           
##   <chr>            <chr>         <dttm>              <dttm>             
## 1 C1D650626C8C899A electric_bike 2024-01-12 15:30:27 2024-01-12 15:37:59
## 2 EECD38BDB25BFCB0 electric_bike 2024-01-08 15:45:46 2024-01-08 15:52:59
## 3 F4A9CE78061F17F7 electric_bike 2024-01-27 12:27:19 2024-01-27 12:35:19
## 4 0A0D9E15EE50B171 classic_bike  2024-01-29 16:26:17 2024-01-29 16:56:06
## 5 33FFC9805E3EFF9A classic_bike  2024-01-31 05:43:23 2024-01-31 06:09:35
## 6 C96080812CD285C5 classic_bike  2024-01-07 11:21:24 2024-01-07 11:30:03
## # ℹ 9 more variables: start_station_name <chr>, start_station_id <chr>,
## #   end_station_name <chr>, end_station_id <chr>, start_lat <dbl>,
## #   start_lng <dbl>, end_lat <dbl>, end_lng <dbl>, member_casual <chr>
str(all_trips)
## tibble [5,860,568 × 13] (S3: tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:5860568] "C1D650626C8C899A" "EECD38BDB25BFCB0" "F4A9CE78061F17F7" "0A0D9E15EE50B171" ...
##  $ rideable_type     : chr [1:5860568] "electric_bike" "electric_bike" "electric_bike" "classic_bike" ...
##  $ started_at        : POSIXct[1:5860568], format: "2024-01-12 15:30:27" "2024-01-08 15:45:46" ...
##  $ ended_at          : POSIXct[1:5860568], format: "2024-01-12 15:37:59" "2024-01-08 15:52:59" ...
##  $ start_station_name: chr [1:5860568] "Wells St & Elm St" "Wells St & Elm St" "Wells St & Elm St" "Wells St & Randolph St" ...
##  $ start_station_id  : chr [1:5860568] "KA1504000135" "KA1504000135" "KA1504000135" "TA1305000030" ...
##  $ end_station_name  : chr [1:5860568] "Kingsbury St & Kinzie St" "Kingsbury St & Kinzie St" "Kingsbury St & Kinzie St" "Larrabee St & Webster Ave" ...
##  $ end_station_id    : chr [1:5860568] "KA1503000043" "KA1503000043" "KA1503000043" "13193" ...
##  $ start_lat         : num [1:5860568] 41.9 41.9 41.9 41.9 41.9 ...
##  $ start_lng         : num [1:5860568] -87.6 -87.6 -87.6 -87.6 -87.7 ...
##  $ end_lat           : num [1:5860568] 41.9 41.9 41.9 41.9 41.9 ...
##  $ end_lng           : num [1:5860568] -87.6 -87.6 -87.6 -87.6 -87.6 ...
##  $ member_casual     : chr [1:5860568] "member" "member" "member" "member" ...
summary(all_trips)
##    ride_id          rideable_type        started_at                    
##  Length:5860568     Length:5860568     Min.   :2024-01-01 00:00:39.00  
##  Class :character   Class :character   1st Qu.:2024-05-20 19:47:53.00  
##  Mode  :character   Mode  :character   Median :2024-07-22 20:36:16.27  
##                                        Mean   :2024-07-17 07:55:47.61  
##                                        3rd Qu.:2024-09-17 20:14:22.56  
##                                        Max.   :2024-12-31 23:56:49.84  
##                                                                        
##     ended_at                      start_station_name start_station_id  
##  Min.   :2024-01-01 00:04:20.00   Length:5860568     Length:5860568    
##  1st Qu.:2024-05-20 20:07:54.75   Class :character   Class :character  
##  Median :2024-07-22 20:53:59.16   Mode  :character   Mode  :character  
##  Mean   :2024-07-17 08:13:06.54                                        
##  3rd Qu.:2024-09-17 20:27:46.02                                        
##  Max.   :2024-12-31 23:59:55.70                                        
##                                                                        
##  end_station_name   end_station_id       start_lat       start_lng     
##  Length:5860568     Length:5860568     Min.   :41.64   Min.   :-87.91  
##  Class :character   Class :character   1st Qu.:41.88   1st Qu.:-87.66  
##  Mode  :character   Mode  :character   Median :41.90   Median :-87.64  
##                                        Mean   :41.90   Mean   :-87.65  
##                                        3rd Qu.:41.93   3rd Qu.:-87.63  
##                                        Max.   :42.07   Max.   :-87.52  
##                                                                        
##     end_lat         end_lng        member_casual     
##  Min.   :16.06   Min.   :-144.05   Length:5860568    
##  1st Qu.:41.88   1st Qu.: -87.66   Class :character  
##  Median :41.90   Median : -87.64   Mode  :character  
##  Mean   :41.90   Mean   : -87.65                     
##  3rd Qu.:41.93   3rd Qu.: -87.63                     
##  Max.   :87.96   Max.   : 152.53                     
##  NA's   :7232    NA's   :7232

The chunk has given us a brief summary of all the data types.

As we can see in the tibbles above, the “started_at” and “ended_at” columns have merged the date and time. Let’s break this down.

all_trips$date <- as.Date(all_trips$started_at) #The default format is yyyy-mm-dd
all_trips$month <- format(as.Date(all_trips$date), "%m")
all_trips$day <- format(as.Date(all_trips$date), "%d")
all_trips$year <- format(as.Date(all_trips$date), "%Y")
all_trips$day_of_week <- format(as.Date(all_trips$date), "%A")

This chunk will separate the columns into year, month, day and day of the week.

We now move on towards finding the duration of ride (in seconds) from every rider. The next two code chunks will find the required answer and show us what it looks like.

all_trips$ride_length <- difftime(all_trips$ended_at,all_trips$started_at)
str(all_trips)
## tibble [5,860,568 × 19] (S3: tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:5860568] "C1D650626C8C899A" "EECD38BDB25BFCB0" "F4A9CE78061F17F7" "0A0D9E15EE50B171" ...
##  $ rideable_type     : chr [1:5860568] "electric_bike" "electric_bike" "electric_bike" "classic_bike" ...
##  $ started_at        : POSIXct[1:5860568], format: "2024-01-12 15:30:27" "2024-01-08 15:45:46" ...
##  $ ended_at          : POSIXct[1:5860568], format: "2024-01-12 15:37:59" "2024-01-08 15:52:59" ...
##  $ start_station_name: chr [1:5860568] "Wells St & Elm St" "Wells St & Elm St" "Wells St & Elm St" "Wells St & Randolph St" ...
##  $ start_station_id  : chr [1:5860568] "KA1504000135" "KA1504000135" "KA1504000135" "TA1305000030" ...
##  $ end_station_name  : chr [1:5860568] "Kingsbury St & Kinzie St" "Kingsbury St & Kinzie St" "Kingsbury St & Kinzie St" "Larrabee St & Webster Ave" ...
##  $ end_station_id    : chr [1:5860568] "KA1503000043" "KA1503000043" "KA1503000043" "13193" ...
##  $ start_lat         : num [1:5860568] 41.9 41.9 41.9 41.9 41.9 ...
##  $ start_lng         : num [1:5860568] -87.6 -87.6 -87.6 -87.6 -87.7 ...
##  $ end_lat           : num [1:5860568] 41.9 41.9 41.9 41.9 41.9 ...
##  $ end_lng           : num [1:5860568] -87.6 -87.6 -87.6 -87.6 -87.6 ...
##  $ member_casual     : chr [1:5860568] "member" "member" "member" "member" ...
##  $ date              : Date[1:5860568], format: "2024-01-12" "2024-01-08" ...
##  $ month             : chr [1:5860568] "01" "01" "01" "01" ...
##  $ day               : chr [1:5860568] "12" "08" "27" "29" ...
##  $ year              : chr [1:5860568] "2024" "2024" "2024" "2024" ...
##  $ day_of_week       : chr [1:5860568] "Friday" "Monday" "Saturday" "Monday" ...
##  $ ride_length       : 'difftime' num [1:5860568] 452 433 480 1789 ...
##   ..- attr(*, "units")= chr "secs"
all_trips$ride_length <- as.numeric(as.character(all_trips$ride_length))
is.numeric(all_trips$ride_length)
## [1] TRUE

The code above converts the new column we created into a number format if in case it wasn’t. But since our column is already in the required format, the chunk confirms it by saying “TRUE”.

If we check the tibble carefully, some of those numbers are negative or zero. The likely reason for this to happen is if engineers and mechanics are taking it out for checks and maintenance. Therefore, for the purpose of this study, we will remove them as we need true data we receive from consumers.

all_trips_v2 <- all_trips[!(all_trips$ride_length <= 0) & !is.na(all_trips$ride_length), ]

Now that the data is clean, we will begin analysing the data.

4. Analyse

mean(all_trips_v2$ride_length)
## [1] 1039.147
median(all_trips_v2$ride_length)
## [1] 583
max(all_trips_v2$ride_length)
## [1] 93596
min(all_trips_v2$ride_length)
## [1] 0.0389998

Using the chunk code above, we are finding:

But since these numbers are for all users, it doesn’t give us much information. So, in our next step, we will find these numbers by grouping user types into Casual and Members.

aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = mean)
##   all_trips_v2$member_casual all_trips_v2$ride_length
## 1                     casual                1509.2684
## 2                     member                 766.4129
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = median)
##   all_trips_v2$member_casual all_trips_v2$ride_length
## 1                     casual                  723.330
## 2                     member                  521.707
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = max)
##   all_trips_v2$member_casual all_trips_v2$ride_length
## 1                     casual                    93596
## 2                     member                    93588
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual, FUN = min)
##   all_trips_v2$member_casual all_trips_v2$ride_length
## 1                     casual               0.05099988
## 2                     member               0.03899980
aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual + all_trips_v2$day_of_week, FUN = mean)
##    all_trips_v2$member_casual all_trips_v2$day_of_week all_trips_v2$ride_length
## 1                      casual                   Friday                1471.7761
## 2                      member                   Friday                 745.5064
## 3                      casual                   Monday                1448.4387
## 4                      member                   Monday                 732.7006
## 5                      casual                 Saturday                1691.7650
## 6                      member                 Saturday                 844.6969
## 7                      casual                   Sunday                1765.4818
## 8                      member                   Sunday                 856.4958
## 9                      casual                 Thursday                1314.6240
## 10                     member                 Thursday                 735.5206
## 11                     casual                  Tuesday                1290.9601
## 12                     member                  Tuesday                 737.0622
## 13                     casual                Wednesday                1336.7755
## 14                     member                Wednesday                 747.1673

Following this code, we can see that the data is grouped by member types followed by day of the week. In order to make it more presentable, we need to arrange it by the day of week.

The following chunk will allow us to do so.

all_trips_v2$day_of_week <- ordered(all_trips_v2$day_of_week, levels=c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))

When rewriting the code, it will give us the following results.

aggregate(all_trips_v2$ride_length ~ all_trips_v2$member_casual + all_trips_v2$day_of_week, FUN = mean)
##    all_trips_v2$member_casual all_trips_v2$day_of_week all_trips_v2$ride_length
## 1                      casual                   Sunday                1765.4818
## 2                      member                   Sunday                 856.4958
## 3                      casual                   Monday                1448.4387
## 4                      member                   Monday                 732.7006
## 5                      casual                  Tuesday                1290.9601
## 6                      member                  Tuesday                 737.0622
## 7                      casual                Wednesday                1336.7755
## 8                      member                Wednesday                 747.1673
## 9                      casual                 Thursday                1314.6240
## 10                     member                 Thursday                 735.5206
## 11                     casual                   Friday                1471.7761
## 12                     member                   Friday                 745.5064
## 13                     casual                 Saturday                1691.7650
## 14                     member                 Saturday                 844.6969
all_trips_v2 %>% 
  mutate(weekday = wday(started_at, label = TRUE)) %>%  
  group_by(member_casual, weekday) %>%
  summarise(number_of_rides = n()           
            ,average_duration = mean(ride_length)) %>% 
  arrange(member_casual, weekday)       
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
## # A tibble: 14 × 4
## # Groups:   member_casual [2]
##    member_casual weekday number_of_rides average_duration
##    <chr>         <ord>             <int>            <dbl>
##  1 casual        Sun              369889            1765.
##  2 casual        Mon              253564            1448.
##  3 casual        Tue              232370            1291.
##  4 casual        Wed              269318            1337.
##  5 casual        Thu              265072            1315.
##  6 casual        Fri              315840            1472.
##  7 casual        Sat              445344            1692.
##  8 member        Sun              417200             856.
##  9 member        Mon              534542             733.
## 10 member        Tue              570619             737.
## 11 member        Wed              610187             747.
## 12 member        Thu              570519             736.
## 13 member        Fri              525766             746.
## 14 member        Sat              479615             845.

The codes below to find out in what ratio the rider type is divided to. It shows us that 63.3% of the riders are annual subscription holders, while 36.7% of the riders are casual riders.

all_trips_v2 %>% 
    group_by(member_casual) %>% 
    summarise(ride_count = length(ride_id), ride_percentage = (length(ride_id) / nrow(all_trips_v2)) * 100)
## # A tibble: 2 × 3
##   member_casual ride_count ride_percentage
##   <chr>              <int>           <dbl>
## 1 casual           2151397            36.7
## 2 member           3708448            63.3
ggplot(all_trips_v2, aes(x = member_casual, fill=member_casual)) +
    geom_bar() +
    labs(x="Casuals vs Members", y="Number Of Rides", title= "Casuals vs Members distribution")

5. Share

From the above analyses we can share our findings. With the help of graphs from the “ggplot2” package, we will be able to generate relevant visualisations.

I. Ride Count by Week

all_trips_v2 %>% 
  mutate(weekday = wday(started_at, label = TRUE)) %>% 
  group_by(member_casual, weekday) %>% 
   summarise(number_of_rides = n()
            ,average_duration = mean(ride_length)) %>% 
  arrange(member_casual, weekday)  %>% 
  ggplot(aes(x = weekday, y = number_of_rides, fill = member_casual)) +
  geom_col(position = "dodge")
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

From the chart shown above, we can see that the annual riders count is much higher compared to casual riders. We can also see that casual riders seem to use the bikes on weekends much more than the weekdays. Therefore we can assume that casual riders are using Cyclistic for recreational purposes.

all_trips_v2 %>%
  group_by(member_casual) %>% 
  summarise(average_ride_length = mean(ride_length), median_length = median(ride_length), 
            max_ride_length = max(ride_length), min_ride_length = min(ride_length))
## # A tibble: 2 × 5
##   member_casual average_ride_length median_length max_ride_length
##   <chr>                       <dbl>         <dbl>           <dbl>
## 1 casual                      1509.          723.           93596
## 2 member                       766.          522.           93588
## # ℹ 1 more variable: min_ride_length <dbl>

II. Average Ride by Week

all_trips_v2 %>% 
  mutate(weekday = wday(started_at, label = TRUE)) %>% 
  group_by(member_casual, weekday) %>% 
  summarise(number_of_rides = n()
            ,average_duration = mean(ride_length)) %>% 
  arrange(member_casual, weekday)  %>% 
  ggplot(aes(x = weekday, y = average_duration, fill = member_casual)) +
  geom_col(position = "dodge")
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

The chart above shows us that the average rides by annual members is relatively low despite being constant throughout most of the week. However, casual riders have a higher average, with the numbers spiking up during the weekends.

III. Ride Count by Month

all_trips_v2 %>%  
  group_by(member_casual, month) %>% 
  summarise(number_of_rides = n(),.groups="drop") %>% 
  arrange(member_casual, month)  %>% 
  ggplot(aes(x = month, y = number_of_rides, fill = member_casual)) +
  labs(title ="Total rides by Members and Casual riders Vs. Month", x = "Month", y= "Number Of Rides") +
  theme(axis.text.x = element_text(angle = 45)) +
  geom_col(position = "dodge")

The chart above shows us that ride counts are much higher for both rider types during the warmer months. Whereas, the colder months, such as November, December, January and February the ride counts drop significantly, especially for casual riders.

all_trips_v2 %>%  
  group_by(member_casual, month) %>% 
  summarise(average_ride_length = mean(ride_length),.groups="drop") %>%
  ggplot(aes(x = month, y = average_ride_length, fill = member_casual)) +
  geom_col(position = "dodge") + 
  labs(title ="Average ride length by Members and Casual riders Vs. Month") +
  theme(axis.text.x = element_text(angle = 30))

The chart above shows us the average ride count of casual members is much higher compared to annual members, despite having lower ride counts.

IV. Rider Type

all_trips_v2 %>%
    group_by(rideable_type) %>% 
    summarise(count = length(ride_id))
## # A tibble: 3 × 2
##   rideable_type      count
##   <chr>              <int>
## 1 classic_bike     2735558
## 2 electric_bike    2979950
## 3 electric_scooter  144337
ggplot(all_trips_v2, aes(x=rideable_type, fill=member_casual)) +
    labs(x="Rideable type", title="Rideable type Vs. Total rides by Members and Casual riders") +
geom_bar()

In the above chart, we can see that annual riders tend to use electric bikes more others. This chart also shows that electric scooters are used significantly less.

V. Rides by Hour

The next step for the analysis is to find out what time of day do most riders use Cyclistic.

First we will ensure the hour column exists and is accessible for the purpose of the analysis.

# Ensure hour column exists
all_trips_v2 <- all_trips_v2 %>%
  mutate(hour = hour(started_at))  # Extract hour from datetime

# Check if hour column was successfully created
str(all_trips_v2)  # Ensure "hour" appears in the output
## tibble [5,859,845 × 20] (S3: tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:5859845] "C1D650626C8C899A" "EECD38BDB25BFCB0" "F4A9CE78061F17F7" "0A0D9E15EE50B171" ...
##  $ rideable_type     : chr [1:5859845] "electric_bike" "electric_bike" "electric_bike" "classic_bike" ...
##  $ started_at        : POSIXct[1:5859845], format: "2024-01-12 15:30:27" "2024-01-08 15:45:46" ...
##  $ ended_at          : POSIXct[1:5859845], format: "2024-01-12 15:37:59" "2024-01-08 15:52:59" ...
##  $ start_station_name: chr [1:5859845] "Wells St & Elm St" "Wells St & Elm St" "Wells St & Elm St" "Wells St & Randolph St" ...
##  $ start_station_id  : chr [1:5859845] "KA1504000135" "KA1504000135" "KA1504000135" "TA1305000030" ...
##  $ end_station_name  : chr [1:5859845] "Kingsbury St & Kinzie St" "Kingsbury St & Kinzie St" "Kingsbury St & Kinzie St" "Larrabee St & Webster Ave" ...
##  $ end_station_id    : chr [1:5859845] "KA1503000043" "KA1503000043" "KA1503000043" "13193" ...
##  $ start_lat         : num [1:5859845] 41.9 41.9 41.9 41.9 41.9 ...
##  $ start_lng         : num [1:5859845] -87.6 -87.6 -87.6 -87.6 -87.7 ...
##  $ end_lat           : num [1:5859845] 41.9 41.9 41.9 41.9 41.9 ...
##  $ end_lng           : num [1:5859845] -87.6 -87.6 -87.6 -87.6 -87.6 ...
##  $ member_casual     : chr [1:5859845] "member" "member" "member" "member" ...
##  $ date              : Date[1:5859845], format: "2024-01-12" "2024-01-08" ...
##  $ month             : chr [1:5859845] "01" "01" "01" "01" ...
##  $ day               : chr [1:5859845] "12" "08" "27" "29" ...
##  $ year              : chr [1:5859845] "2024" "2024" "2024" "2024" ...
##  $ day_of_week       : Ord.factor w/ 7 levels "Sunday"<"Monday"<..: 6 2 7 2 4 1 6 5 2 4 ...
##  $ ride_length       : num [1:5859845] 452 433 480 1789 1572 ...
##  $ hour              : int [1:5859845] 15 15 12 16 5 11 14 18 14 19 ...

After we get the positive result, we will use the following chunk.

all_trips_v2%>%
  group_by(member_casual, hour) %>%
  summarise(average_ride_length = mean(ride_length),.groups = "drop") %>%
  ggplot(aes(x = hour, y = average_ride_length, fill = member_casual)) + geom_col(position = "dodge") + labs(title = "Average ride length by members and casual riders Vs. hour") 

As with previous visualisations, the annual members have a relatively constant ride length throughout the day, while the casual riders average ride count drops in the mornings and again drops slightly during the evenings.

all_trips_v2 %>%
  ggplot(aes(x = hour, fill = member_casual)) + 
  geom_bar(position = "dodge") +  # Automatically counts rides
  labs(
    x = "Hour of the Day",
    y = "Number of Rides",
    title = "Cyclistic Users' Demand by Hour"
  ) +
  theme_minimal()

The final visualisation in this process shows us that members have much higher ride counts throughout most of the time of day. Between midnight and 7-8 am is when the ride counts are significantly low.

For annual members, the ride count seems to spike between the morning rush hour and the evening rush as well. This tends to suggest many annual riders use Cyclistic to commute to and from work. The same trend can be seen for casual riders but in a lower volume.

5. Act

The final step of this case study involves an actionable plan. With the help of the analysis above, The following suggestions can be shared during the next stakeholders meeting, with hopes to increase annual members count.

Deliverables

  1. Create subscription plans to encourage rides during winter:

Most casual riders seem to use the Cyclistic bikes less during winter season. Introducing a short term subscription, or a cheaper model targeted for the specific months, could encourage the casual members towards subscription.

  1. Incentivise using Cyclistic bikes for daily commute

As we could see that most annual riders use Cyclistic bikes during rush hours, potentially for their daily commute. A subscription plan that encourages casual riders to do the same could be useful.

  1. Change the cost per individual rides

Many casual riders tend to use one time payment module as they believe a subscription would be more expensive for them. If we can change the module to make one time payments more expensive and subscriptions much cheaper, it could sway many riders to do so.

  1. Give trial subscriptions

Allow casual members to enjoy subscriptions for a month free to see what benefits they could be enjoying as annual members.