Toronto Bike-Share Data Analysis

shirleyzhang
10 min readApr 10, 2021

As the final course project for CIV1498: Introduction to Data Science for Civil and Mineral Engineers, we analyzed the bike-share data in the City of Toronto from 2017 to 2020. First, we performed data cleaning, exploratory analysis, and modelling to predict the hourly ridership demand. Then, we used the insights from historic data to inform expansion of the bike-share program.

The Datasets

The historical ridership data is provided by Bike Share Toronto and contains data from 2017, 2018, 2019, and 2020. Each entry represents a trip and contains the following information:

  • Trip Id: unique identifier for each trip
  • Subscription Id: user membership/subscription id
  • Trip Duration: duration of the trip measured in seconds
  • Start Station Id: station ID where the trip started
  • Start Time: time when the trip started
  • Start Station Name: name of the station where the trip started
  • End Station Id: station ID where the trip ended
  • End Time: time when the trip ended
  • End Station Name: name of the station where the trip ended
  • Bike Id: unique identifier of each bicycle in circulation
  • User Type: identify if the user has a membership or purchased a pass

The station information from the Bike Share API endpoint is in the following format:

The Toronto weather dataset contains historical weather in Toronto provided by the City of Toronto. The weather station is located at 43.63 latitudes and -79.4 longitude. Some of the more relevant fields are:

  • Date/Time: Date and time of the measurement
  • Temp (°C): Temperature in degrees Celsius
  • Rel Hum (%): Relative humidity as a percentage
  • Wind Spd (km/h): The speed of motion of air in kilometres per hour
  • Visibility (km): The distance at which objects of suitable size can be seen and identified
  • Weather: Observations of atmospheric phenomenon including the occurrence of weather and obstructions to vision

The Toronto neighborhoods data contains the location and geometry of neighborhood boundaries in Toronto.

Data Cleaning and Wrangling

To aggregate the different datasets into a single DataFrame for further analysis, a number of data cleaning steps are employed as described below:

  1. Column headers from the 2017–2018 ridership data are modified to match the 2019–2020 data.
  2. Numeric values are down-casted to reduce memory usage.
  3. Trip start and end times are unified to be in the same time zone (EST) and format (d/m/Y H:M:S) and are converted to pandas datetime objects.
  4. Ridership DataFrames are concatenated to create a single DataFrame for 2017–2020. Trips from the end of 2016 are discarded.
  5. Missing station Ids in the ridership data are filled in by fuzzy-matching the station names with the stations data.
  6. False starts (shorter than 60 seconds) and incomplete trips (+/- 1.5 * IQR away from the median duration) are deemed as outliers and removed.
  7. Weather data is merged with the ridership data.

Finally, a single DataFrame is created containing ridership, station, and weather data from 2017 to 2020, with each row corresponding to a trip.

Now that we have a cleaned and consolidated DataFrame, we can aggregate the data for different time scales to explore different relationships. For hourly, daily, weekly, monthly, and year rides, the following parameters are averaged:

  • Total trips started
  • Total trips starting from or ending at each bike station
  • Total trips made by casual and annual riders
  • Trip duration
  • Weather conditions
  • Workday vs Weekend/Holiday

Two additional DataFrames are created for day of week and month of year as shown below:

Exploratory Analysis

Using the cleaned DataFrames, we can now explore the dataset to extract insights on factors affecting bike share usage. Here are some of our findings organized into temporal, seasonal, and geospatial analysis:

Temporal Analysis

There is a gradual increase in ridership each year, and casual ridership surpassed annual membership riders in the 2nd quarter of 2020 likely due to the impact of COVID-19.

The uncharacteristic drop in April of 2020 also shows the impact of lockdowns.

Ridership from both annual and casual members increases during warmer months.

People also tend to take longer rides on warmer days.

Annual members mostly use bike-share from Monday to Friday and on workdays in general, likely for commuting purposes.

Casual members seem to use bike share mainly for leisure purposes as shown in the large increase in their weekend and non-workday ridership.

During the “Free-Ride Wednesday” promotion, there is a large increase in casual ridership, but annual membership usage in unaffected because annual members do not benefit from the promotion.

More annual Members use bike-share during rush-hours again for commuting purposes, where casual ridership gradual increases throughout the afternoon and evening.

The Waterfront Communities-The Island neighborhood has the most departing and arriving rides, followed by Bay Street Corridor and Niagara.

Ridership decreased in April 2020 due to COVID-19 lockdowns imposed in late March.

People tend to take longer trips during warmer months.

  1. Annual members tend to use bike share on workdays while casual riders tend to use bike share on non-workdays.

There is a large increase in casual ridership on Free-Ride Wednesdays while annual ridership is unaffected.

Annual members tend to use bikeshare during peak rush-hours while casual member usage gradually increases throughout the afternoon.

Weather Analysis

On average, hours with clear weather have the highest ridership.

Higher temperatures correlated with higher ridership, and higher wind speeds correlated with lower ridership.

Higher temperatures also correlated with higher duration, but higher wind speeds do not correlated with trip duration.

Geospatial analysis

Waterfront Communities has the most departing & arriving rides.

Most bikeshare departures and arrivals are downtown and around TTC stations.

Most popular bikeshare stations are within 2km of a subway station.

We also examined the distribution of ridership throughout the day, week, month and year. For daily variations, we noticed that ridership picks up in suburban neighborhoods earlier in the day and move into the downtown core.

For ridership distribution throughout the week, we noticed a decrease on the weekend in the downtown core.

Looking at distribution in different months , we noticed that ridership is more concentrated in the downtown core during colder months. However, in the summer, there is increased activity along the lakeshore trails.

Looking at distribution across the years, we observed that the lakeshore is especially busy in 2020 compared to previous years. This could be another effect of COVID-19 as people started working from home and sought out more socially-distanced activities.

Modelling

Now that we have an in-depth understanding of the ridership trends and patterns, we created a preliminary model to predict hourly demand. The first thing we did was to split the data for feature selection and hyperparameter tuning. Since we have a time series data in chronological order, some data points that are very close in time exhibit very similar behaviour. Therefore, we decided to use the first 20 days of each month to train our model, validate it with the next 5 days of each month, and use the rest of the days as our test dataset, in order to preserve the temporal trends and avoid data leakage.

From our exploratory analysis, we have identified some quantitative and categorical features that have considerable influence on hourly ridership. We converted quantitative features to standard units and categorical features to dummy variables. These features include:

  • Year after 2017, Month, Day of week, Hour
  • •Average temperature, Wind speed, Visibility, Weather, Holiday status

Then, we selected a few regression models nd evaluated them using three loss functions. Both the mean absolute error and root mean squared error measure the average magnitude of the error, and RMSE specifically penalizes large errors. The R2 score shows how well the real data points can be explained by our predictions.

We also used loss functions to compare the relevance of the features we selected. We found that models that use only time-related features have a higher error and a lower R2 score than models that use both time and weather related features. This confirms our previous observations from exploratory analysis that weather has an influence on ridership and should be included in our feature selection.

To visualize the performance of our model, we have created a residual graph that plots the actual data against the deviation from actual data. A perfect prediction would result in a horizontal line centered on 0. From our residual plot, we can see that our model performs relatively well for hours with fewer trips. However, the larger hourly trip counts are generally greater than our prediction. This means our model is still too simple and does not capture the reasons why some hours are especially busy.

Now that we have our final model configuration, we can use it to help inform the City regarding their bike share expansion plan. To make future predictions, we suggest training the model in a chronological order to capture the general trend of an increased popularity with the bike share program. For example, we can train the model on the 2017–2018 data and test on the 2019 data.

By comparing the predicted and actual trip count in 2019, we can see that our model performs relatively well for spring/fall/winter trips, but it tends to underestimate trip counts in the summer. This is consistent with our conclusions from the residual plot and shows that we might not be capturing seasonal trends very well in our model.

Summary and Recommendations

To improve future analysis, we suggest making a few corrections to the database. This includes filling in missing station and bike Ids as well as resolving the occasional mismatch between station name and station Ids.

We also suggest collecting additional data such as station open and close stations in order to scale station usage with availability and more accurately predict station popularity. We also suggest collecting data on bike lane usage to better capture routing patterns.

Here is a summary of our findings:

  1. Ridership has been increasing from 2017 to 2020.
  2. Ridership is highest in the summer.
  3. The system is used by both commuters and recreational riders.
  4. Commuters use the system on weekdays from 9am to 5pm.
  5. Recreational users use the system more frequently on weekends and holidays and during the summer.
  6. Ridership is concentrated within the downtown core on weekdays, but more evenly spread out on holidays and in the summer.
  7. Stations along the waterfront are especially popular on non-workdays.

At last, we would like to make the following recommendations to the City of Toronto:

  1. System expansion is recommended due to increasing popularity.
  2. The City should continue to offer promotions such as Free Ride Wednesdays.
  3. We recommend densifying the downtown bike-share infrastructure because ridership is highest there. There is also high usage of stations along the waterfront. Additionally, more stations can be opened near existing transit hubs as well as new transit lines such as the Ontario Line and Eglington Crosstown.

--

--