ASPED: An Audio Dataset for Detecting Pedestrians
Pavan Seshadri, Chaeyeon Han, Bon-Woo Koo, Noah Posner, Subhrajit Guhathakurta, Alexander Lerch
TL;DR
ASPED introduces a novel audio-only pedestrian detection task and a large-scale dataset collected at Georgia Tech campuses to enable research in audio-based urban sensing. The authors benchmark three model families—VGGish, a CONV-based encoder, and the Audio Spectrogram Transformer (AST)—on 1-second frames across four proximity radii, with data-imbalance mitigation. Results indicate that audio cues can reveal pedestrian presence, with AST and CONV providing the strongest performance but still leaving room for improvement in noisy urban environments. The dataset and baselines open avenues for noise-robust models and for extending to more environments and regression-based counting.
Abstract
We introduce the new audio analysis task of pedestrian detection and present a new large-scale dataset for this task. While the preliminary results prove the viability of using audio approaches for pedestrian detection, they also show that this challenging task cannot be easily solved with standard approaches.
