Table of Contents
Fetching ...

From Counting Stations to City-Wide Estimates: Data-Driven Bicycle Volume Extrapolation

Silke K. Kaiser, Nadja Klein, Lynn H. Kaack

TL;DR

The study demonstrates that city-wide bicycle-volume estimation is feasible by fusing Berlin’s long-term counts with diverse open data sources (notably Strava crowdsourced data and infrastructure indicators) using Extreme Gradient Boosting. The authors show robust daily and AADB predictions via leave-one-station-out validation, with Strava and infrastructure inputs driving most predictive power, and reveal substantial accuracy gains when incorporating short-term sample counts (up to ~2/3 reduction in error). A street-level proof-of-concept highlights temporal capture but reveals spatial nuances that require further refinement, while simulations indicate practical sampling strategies (e.g., 1-day counts) can markedly improve city-scale estimates. The framework offers a data-driven foundation for infrastructure planning and civil-society advocacy, and is readily reproducible with open data and standard ML tools.

Abstract

Shifting to cycling in urban areas reduces greenhouse gas emissions and improves public health. Street-level bicycle volume information would aid cities in planning targeted infrastructure improvements to encourage cycling and provide civil society with evidence to advocate for cyclists' needs. Yet, the data currently available to cities and citizens often only comes from sparsely located counting stations. This paper extrapolates bicycle volume beyond these few locations to estimate bicycle volume for the entire city of Berlin. We predict daily and average annual daily street-level bicycle volumes using machine-learning techniques and various public data sources. These include app-based crowdsourced data, infrastructure, bike-sharing, motorized traffic, socioeconomic indicators, weather, and holiday data. Our analysis reveals that the best-performing model is XGBoost, and crowdsourced cycling and infrastructure data are most important for the prediction. We further simulate how collecting short-term counts at predicted locations improves performance. By providing ten days of such sample counts for each predicted location to the model, we are able to halve the error and greatly reduce the variability in performance among predicted locations.

From Counting Stations to City-Wide Estimates: Data-Driven Bicycle Volume Extrapolation

TL;DR

The study demonstrates that city-wide bicycle-volume estimation is feasible by fusing Berlin’s long-term counts with diverse open data sources (notably Strava crowdsourced data and infrastructure indicators) using Extreme Gradient Boosting. The authors show robust daily and AADB predictions via leave-one-station-out validation, with Strava and infrastructure inputs driving most predictive power, and reveal substantial accuracy gains when incorporating short-term sample counts (up to ~2/3 reduction in error). A street-level proof-of-concept highlights temporal capture but reveals spatial nuances that require further refinement, while simulations indicate practical sampling strategies (e.g., 1-day counts) can markedly improve city-scale estimates. The framework offers a data-driven foundation for infrastructure planning and civil-society advocacy, and is readily reproducible with open data and standard ML tools.

Abstract

Shifting to cycling in urban areas reduces greenhouse gas emissions and improves public health. Street-level bicycle volume information would aid cities in planning targeted infrastructure improvements to encourage cycling and provide civil society with evidence to advocate for cyclists' needs. Yet, the data currently available to cities and citizens often only comes from sparsely located counting stations. This paper extrapolates bicycle volume beyond these few locations to estimate bicycle volume for the entire city of Berlin. We predict daily and average annual daily street-level bicycle volumes using machine-learning techniques and various public data sources. These include app-based crowdsourced data, infrastructure, bike-sharing, motorized traffic, socioeconomic indicators, weather, and holiday data. Our analysis reveals that the best-performing model is XGBoost, and crowdsourced cycling and infrastructure data are most important for the prediction. We further simulate how collecting short-term counts at predicted locations improves performance. By providing ten days of such sample counts for each predicted location to the model, we are able to halve the error and greatly reduce the variability in performance among predicted locations.

Paper Structure

This paper contains 40 sections, 2 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Performance of XGBoost model at the daily level and for average annual daily bicycle volume estimations (aadb) across the individual counting stations. Subfigure b) and d) were trained on ten days' worth of sample data and on the additional long-term counting stations (full-city model specification). Highlighted in all graphs are the counting stations whose error exceeds or is below a deviation of 1 standard deviation from the mean. The color coding and the ordering of the counting stations across all subplots are the same to ensure comparability. The counting station 'SEN' is left out in subplot b) and d), due to the small number of observations available.
  • Figure 2: Feature importance and proof of concept based on an XGBoost model trained on data of all available long-term counting stations.
  • Figure 3: Shown is the effect of collecting additional sample data at a new location to predict the daily volume of bicycles using XGBoost. In the left diagram, the models are trained on the full-city available data, both long-term data from other sites and sample data from the location in question; in the right diagram, the models are trained on location-specific sample data only. Best-performing specifications are depicted in gray in the other plot to allow for comparison. The error is the average over the 19 counting stations used, with 95% confidence intervals calculated from 10 repeated samples.
  • Figure 4: Location of the 12 short-term and 20 long-term counting stations within Berlin.
  • Figure 5: Descriptive statistics of the counter stations' measurements,for the time period considered in this paper.
  • ...and 3 more figures