Classifying Bicycle Infrastructure Using On-Bike Street-Level Images
Kal Backman, Ben Beck, Dana Kulić
TL;DR
The paper addresses the challenge of mapping city-wide cycling infrastructure using on-board street-level imagery. It introduces a hierarchical infrastructure classifier that processes image sequences with a ConvNeXt-V2 backbone, a latent encoder, a temporal self-attention module, and a decoder to output main and sub-class labels. The approach is trained on a large crowd-sourced Melbourne dataset with GPS-OSM labeling and demonstrates high accuracy (main class ~96%, sub-class ~95%) and robustness to extreme feature sparsity. It discusses labeling noise and limitations, and outlines avenues for extending to dynamic, crowd-sourced infrastructure maps to guide safer cycling networks.
Abstract
While cycling offers an attractive option for sustainable transportation, many potential cyclists are discouraged from taking up cycling due to the lack of suitable and safe infrastructure. Efficiently mapping cycling infrastructure across entire cities is necessary to advance our understanding of how to provide connected networks of high-quality infrastructure. Therefore we propose a system capable of classifying available cycling infrastructure from on-bike smartphone camera data. The system receives an image sequence as input, temporally analyzing the sequence to account for sparsity of signage. The model outputs cycling infrastructure class labels defined by a hierarchical classification system. Data is collected via participant cyclists covering 7,006Km across the Greater Melbourne region that is automatically labeled via a GPS and OpenStreetMap database matching algorithm. The proposed model achieved an accuracy of 95.38%, an increase in performance of 7.55% compared to the non-temporal model. The model demonstrated robustness to extreme absence of image features where the model lost only 6.6% in accuracy after 90% of images being replaced with blank images. This work is the first to classify cycling infrastructure using only street-level imagery collected from bike-mounted mobile phone cameras, while demonstrating robustness to feature sparsity via long temporal sequence analysis.
