Device-Robust Acoustic Scene Classification via Impulse Response Augmentation
Tobias Morocutti, Florian Schmid, Khaled Koutini, Gerhard Widmer
TL;DR
This paper tackles the problem of distribution shifts in Acoustic Scene Classification caused by varying recording devices. It introduces Device Impulse Response (DIR) augmentation, convolving training audio with 66 DIRs to simulate diverse microphone characteristics, and evaluates its impact across CNN and transformer ASC models on TAU20/TAU22 datasets. DIR augmentation matches the performance of the state-of-the-art Freq-MixStyle when used alone and is highly complementary when combined with Freq-MixStyle, yielding state-of-the-art unseen-device performance in several configurations. The approach is simple, effective, and broadly beneficial across architectures, reducing device-specific performance gaps and enhancing practical robustness of ASC systems.
Abstract
The ability to generalize to a wide range of recording devices is a crucial performance factor for audio classification models. The characteristics of different types of microphones introduce distributional shifts in the digitized audio signals due to their varying frequency responses. If this domain shift is not taken into account during training, the model's performance could degrade severely when it is applied to signals recorded by unseen devices. In particular, training a model on audio signals recorded with a small number of different microphones can make generalization to unseen devices difficult. To tackle this problem, we convolve audio signals in the training set with pre-recorded device impulse responses (DIRs) to artificially increase the diversity of recording devices. We systematically study the effect of DIR augmentation on the task of Acoustic Scene Classification using CNNs and Audio Spectrogram Transformers. The results show that DIR augmentation in isolation performs similarly to the state-of-the-art method Freq-MixStyle. However, we also show that DIR augmentation and Freq-MixStyle are complementary, achieving a new state-of-the-art performance on signals recorded by devices unseen during training.
