Context-Aware Automated Passenger Counting Data Denoising
Noëlie Cherrier, Baptiste Rérolle, Martin Graive, Amir Dib, Eglantine Schmitt
TL;DR
This work tackles the challenge of noisy APC data for onboard occupancy estimation by proposing a context-aware denoising method framed as a constrained integer linear optimization. It integrates ticketing data and historical priors through a three-stage optimization that first removes outliers, then aligns denoised counts with observations, and finally selects solutions closest to prior distributions. The approach yields robust occupancy estimates across real and simulated networks, offering improved reliability over baselines and acceptable computation times, with potential applicability to downstream tasks like O/D reconstruction. The results demonstrate that incorporating ticketing and historical priors enhances the consistency and interpretability of APC-derived ridership insights, supporting faster, data-driven decisions for transit operators.
Abstract
A reliable and accurate knowledge of the ridership in public transportation networks is crucial for public transport operators and public authorities to be aware of their network's use and optimize transport offering. Several techniques to estimate ridership exist nowadays, some of them in an automated manner. Among them, Automatic Passenger Counting (APC) systems detect passengers entering and leaving the vehicle at each station of its course. However, data resulting from these systems are often noisy or even biased, resulting in under or overestimation of onboard occupancy. In this work, we propose a denoising algorithm for APC data to improve their robustness and ease their analyzes. The proposed approach consists in a constrained integer linear optimization, taking advantage of ticketing data and historical ridership data to further constrain and guide the optimization. The performances are assessed and compared to other denoising methods on several public transportation networks in France, to manual counts available on one of these networks, and on simulated data.
