Improving Acoustic Scene Classification in Low-Resource Conditions

Zhi Chen; Yun-Fei Shao; Yong Ma; Mingsheng Wei; Le Zhang; Wei-Qiang Zhang

Improving Acoustic Scene Classification in Low-Resource Conditions

Zhi Chen, Yun-Fei Shao, Yong Ma, Mingsheng Wei, Le Zhang, Wei-Qiang Zhang

TL;DR

This work addresses acoustic scene classification (ASC) under low-resource conditions and cross-device variability. It proposes DS-FlexiNet, a lightweight architecture that fuses depthwise separable convolutions with ResNet-like residual connections, augmented by Quantization Aware Training (QAT), Auto Device Impulse Response (ADIR), and Freq-MixStyle (FMS), plus Knowledge Distillation (KD) from twelve teacher models. Key innovations include Residual Normalization with RN(x)=\lambda\cdot x+IN(x) to capture device-specific features, frequency-based augmentation via FMS, energy-aware DIR simulation via ADIR, and a soft-teacher fusion scheme that combines logits across teachers; experimental results on the TAU Urban Acoustic Scene 2022 Mobile Development dataset demonstrate improved robustness and efficiency. The approach enables robust ASC deployment on mobile and embedded devices, addressing inter-device variations and resource constraints.

Abstract

Acoustic Scene Classification (ASC) identifies an environment based on an audio signal. This paper explores ASC in low-resource conditions and proposes a novel model, DS-FlexiNet, which combines depthwise separable convolutions from MobileNetV2 with ResNet-inspired residual connections for a balance of efficiency and accuracy. To address hardware limitations and device heterogeneity, DS-FlexiNet employs Quantization Aware Training (QAT) for model compression and data augmentation methods like Auto Device Impulse Response (ADIR) and Freq-MixStyle (FMS) to improve cross-device generalization. Knowledge Distillation (KD) from twelve teacher models further enhances performance on unseen devices. The architecture includes a custom Residual Normalization layer to handle domain differences across devices, and depthwise separable convolutions reduce computational overhead without sacrificing feature representation. Experimental results show that DS-FlexiNet excels in both adaptability and performance under resource-constrained conditions.

Improving Acoustic Scene Classification in Low-Resource Conditions

TL;DR

Abstract

Improving Acoustic Scene Classification in Low-Resource Conditions

Authors

TL;DR

Abstract

Table of Contents

Figures (3)