Can Synthetic Data Boost the Training of Deep Acoustic Vehicle Counting Networks?
Stefano Damiano, Luca Bondi, Shabnam Ghaffarzadegan, Andre Guntoro, Toon van Waterschoot
TL;DR
This paper tackles acoustic vehicle counting (AVC) under limited real-world data by introducing a synthetic data generation and mixed-training strategy. It combines a CRNN with GCC-Phat and a learnable Gabor filterbank to count four categories (cars and commercial vehicles across two directions) using a four-microphone array. Synthetic data generated with pyroadacoustics, Harmonoise, and Baldan engine models pre-trains the network, which is then fine-tuned with limited real data, significantly reducing the real-data burden. The results show substantial gains: with 24 hours of real data, cars improve from 63% to 88% accuracy and commercial vehicles from 86% to 94%, demonstrating the practicality of synthetic pre-training for AVC.
Abstract
In the design of traffic monitoring solutions for optimizing the urban mobility infrastructure, acoustic vehicle counting models have received attention due to their cost effectiveness and energy efficiency. Although deep learning has proven effective for visual traffic monitoring, its use has not been thoroughly investigated in the audio domain, likely due to real-world data scarcity. In this work, we propose a novel approach to acoustic vehicle counting by developing: i) a traffic noise simulation framework to synthesize realistic vehicle pass-by events; ii) a strategy to mix synthetic and real data to train a deep-learning model for traffic counting. The proposed system is capable of simultaneously counting cars and commercial vehicles driving on a two-lane road, and identifying their direction of travel under moderate traffic density conditions. With only 24 hours of labeled real-world traffic noise, we are able to improve counting accuracy on real-world data from $63\%$ to $88\%$ for cars and from $86\%$ to $94\%$ for commercial vehicles.
