From Counting Stations to City-Wide Estimates: Data-Driven Bicycle Volume Extrapolation
Silke K. Kaiser, Nadja Klein, Lynn H. Kaack
TL;DR
The study demonstrates that city-wide bicycle-volume estimation is feasible by fusing Berlin’s long-term counts with diverse open data sources (notably Strava crowdsourced data and infrastructure indicators) using Extreme Gradient Boosting. The authors show robust daily and AADB predictions via leave-one-station-out validation, with Strava and infrastructure inputs driving most predictive power, and reveal substantial accuracy gains when incorporating short-term sample counts (up to ~2/3 reduction in error). A street-level proof-of-concept highlights temporal capture but reveals spatial nuances that require further refinement, while simulations indicate practical sampling strategies (e.g., 1-day counts) can markedly improve city-scale estimates. The framework offers a data-driven foundation for infrastructure planning and civil-society advocacy, and is readily reproducible with open data and standard ML tools.
Abstract
Shifting to cycling in urban areas reduces greenhouse gas emissions and improves public health. Street-level bicycle volume information would aid cities in planning targeted infrastructure improvements to encourage cycling and provide civil society with evidence to advocate for cyclists' needs. Yet, the data currently available to cities and citizens often only comes from sparsely located counting stations. This paper extrapolates bicycle volume beyond these few locations to estimate bicycle volume for the entire city of Berlin. We predict daily and average annual daily street-level bicycle volumes using machine-learning techniques and various public data sources. These include app-based crowdsourced data, infrastructure, bike-sharing, motorized traffic, socioeconomic indicators, weather, and holiday data. Our analysis reveals that the best-performing model is XGBoost, and crowdsourced cycling and infrastructure data are most important for the prediction. We further simulate how collecting short-term counts at predicted locations improves performance. By providing ten days of such sample counts for each predicted location to the model, we are able to halve the error and greatly reduce the variability in performance among predicted locations.
