Table of Contents
Fetching ...

Can Synthetic Data Boost the Training of Deep Acoustic Vehicle Counting Networks?

Stefano Damiano, Luca Bondi, Shabnam Ghaffarzadegan, Andre Guntoro, Toon van Waterschoot

TL;DR

This paper tackles acoustic vehicle counting (AVC) under limited real-world data by introducing a synthetic data generation and mixed-training strategy. It combines a CRNN with GCC-Phat and a learnable Gabor filterbank to count four categories (cars and commercial vehicles across two directions) using a four-microphone array. Synthetic data generated with pyroadacoustics, Harmonoise, and Baldan engine models pre-trains the network, which is then fine-tuned with limited real data, significantly reducing the real-data burden. The results show substantial gains: with 24 hours of real data, cars improve from 63% to 88% accuracy and commercial vehicles from 86% to 94%, demonstrating the practicality of synthetic pre-training for AVC.

Abstract

In the design of traffic monitoring solutions for optimizing the urban mobility infrastructure, acoustic vehicle counting models have received attention due to their cost effectiveness and energy efficiency. Although deep learning has proven effective for visual traffic monitoring, its use has not been thoroughly investigated in the audio domain, likely due to real-world data scarcity. In this work, we propose a novel approach to acoustic vehicle counting by developing: i) a traffic noise simulation framework to synthesize realistic vehicle pass-by events; ii) a strategy to mix synthetic and real data to train a deep-learning model for traffic counting. The proposed system is capable of simultaneously counting cars and commercial vehicles driving on a two-lane road, and identifying their direction of travel under moderate traffic density conditions. With only 24 hours of labeled real-world traffic noise, we are able to improve counting accuracy on real-world data from $63\%$ to $88\%$ for cars and from $86\%$ to $94\%$ for commercial vehicles.

Can Synthetic Data Boost the Training of Deep Acoustic Vehicle Counting Networks?

TL;DR

This paper tackles acoustic vehicle counting (AVC) under limited real-world data by introducing a synthetic data generation and mixed-training strategy. It combines a CRNN with GCC-Phat and a learnable Gabor filterbank to count four categories (cars and commercial vehicles across two directions) using a four-microphone array. Synthetic data generated with pyroadacoustics, Harmonoise, and Baldan engine models pre-trains the network, which is then fine-tuned with limited real data, significantly reducing the real-data burden. The results show substantial gains: with 24 hours of real data, cars improve from 63% to 88% accuracy and commercial vehicles from 86% to 94%, demonstrating the practicality of synthetic pre-training for AVC.

Abstract

In the design of traffic monitoring solutions for optimizing the urban mobility infrastructure, acoustic vehicle counting models have received attention due to their cost effectiveness and energy efficiency. Although deep learning has proven effective for visual traffic monitoring, its use has not been thoroughly investigated in the audio domain, likely due to real-world data scarcity. In this work, we propose a novel approach to acoustic vehicle counting by developing: i) a traffic noise simulation framework to synthesize realistic vehicle pass-by events; ii) a strategy to mix synthetic and real data to train a deep-learning model for traffic counting. The proposed system is capable of simultaneously counting cars and commercial vehicles driving on a two-lane road, and identifying their direction of travel under moderate traffic density conditions. With only 24 hours of labeled real-world traffic noise, we are able to improve counting accuracy on real-world data from to for cars and from to for commercial vehicles.
Paper Structure (7 sections, 5 figures, 1 table)

This paper contains 7 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: The proposed CRNN architecture takes as input the raw signal from a 4-channel linear microphone array and computes in parallel: i) the Generalized Cross-Correlation with Phase Transform (GCC-Phat) between pairs of channels; ii) a Learnable Filterbank with Gabor filters. Two convolutional encoders followed by Time-Distributed Multi-Layer Perceptrons (TD-MLP) compute spatial and semantic features, respectively. The concatenated features are processed by a further TD-MLP layer, followed by a Gated Recurrent Unit (GRU) and a fully connected (FC) layer to regress the number of vehicles per type (car, CV) and per direction (left-to-right, right-to-left).
  • Figure 2: Distribution of recorded events in four target categories averaged over $60s$ audio segments.
  • Figure 3: Test accuracy of models trained on an increasing amount of real-world data ($\text{RW}$).
  • Figure 4: Test accuracy of models pre-trained on synthetic data and then fine-tuned ($\text{FT}$) on an increasing amount of real-world data.
  • Figure 5: Average accuracy (marks) and accuracy ranges (whiskers) obtained using a model pre-trained on synthetic data and fine-tuned (FT) on an increasing amount of real-world data, and a model trained from scratch using same amount of real-world data (RW).