Device-Robust Acoustic Scene Classification via Impulse Response Augmentation

Tobias Morocutti; Florian Schmid; Khaled Koutini; Gerhard Widmer

Device-Robust Acoustic Scene Classification via Impulse Response Augmentation

Tobias Morocutti, Florian Schmid, Khaled Koutini, Gerhard Widmer

TL;DR

This paper tackles the problem of distribution shifts in Acoustic Scene Classification caused by varying recording devices. It introduces Device Impulse Response (DIR) augmentation, convolving training audio with 66 DIRs to simulate diverse microphone characteristics, and evaluates its impact across CNN and transformer ASC models on TAU20/TAU22 datasets. DIR augmentation matches the performance of the state-of-the-art Freq-MixStyle when used alone and is highly complementary when combined with Freq-MixStyle, yielding state-of-the-art unseen-device performance in several configurations. The approach is simple, effective, and broadly beneficial across architectures, reducing device-specific performance gaps and enhancing practical robustness of ASC systems.

Abstract

The ability to generalize to a wide range of recording devices is a crucial performance factor for audio classification models. The characteristics of different types of microphones introduce distributional shifts in the digitized audio signals due to their varying frequency responses. If this domain shift is not taken into account during training, the model's performance could degrade severely when it is applied to signals recorded by unseen devices. In particular, training a model on audio signals recorded with a small number of different microphones can make generalization to unseen devices difficult. To tackle this problem, we convolve audio signals in the training set with pre-recorded device impulse responses (DIRs) to artificially increase the diversity of recording devices. We systematically study the effect of DIR augmentation on the task of Acoustic Scene Classification using CNNs and Audio Spectrogram Transformers. The results show that DIR augmentation in isolation performs similarly to the state-of-the-art method Freq-MixStyle. However, we also show that DIR augmentation and Freq-MixStyle are complementary, achieving a new state-of-the-art performance on signals recorded by devices unseen during training.

Device-Robust Acoustic Scene Classification via Impulse Response Augmentation

TL;DR

Abstract

Paper Structure (19 sections, 2 figures, 1 table)

This paper contains 19 sections, 2 figures, 1 table.

Introduction
Related Work
Recording Device Generalization
Impulse Response Augmentation
Device IR Augmentation
Experimental Setup
Architectures
Audio Preprocessing
Training Setup
Device Generalization Methods
Device Generalization Score
Results
Comparison to other Device Generalization Methods
Effect on different Models
CP-ResNet
...and 4 more sections

Figures (2)

Figure 1: Impulse response and frequency magnitude response of a Toshiba Type G recording device.
Figure 2: Parallel coordinate plot visualizing the relationship between the Freq-MixStyle probability ($p_{fms}$), the DIR augmentation probability ($p_{dir}$), accuracy on unseen devices and overall accuracy. Each line is an average over three experiments of running PaSST on TAU22 using DIR + FMS as the device generalization method.

Device-Robust Acoustic Scene Classification via Impulse Response Augmentation

TL;DR

Abstract

Device-Robust Acoustic Scene Classification via Impulse Response Augmentation

Authors

TL;DR

Abstract

Table of Contents

Figures (2)