Table of Contents
Fetching ...

Improving Acoustic Scene Classification in Low-Resource Conditions

Zhi Chen, Yun-Fei Shao, Yong Ma, Mingsheng Wei, Le Zhang, Wei-Qiang Zhang

TL;DR

This work addresses acoustic scene classification (ASC) under low-resource conditions and cross-device variability. It proposes DS-FlexiNet, a lightweight architecture that fuses depthwise separable convolutions with ResNet-like residual connections, augmented by Quantization Aware Training (QAT), Auto Device Impulse Response (ADIR), and Freq-MixStyle (FMS), plus Knowledge Distillation (KD) from twelve teacher models. Key innovations include Residual Normalization with RN(x)=\lambda\cdot x+IN(x) to capture device-specific features, frequency-based augmentation via FMS, energy-aware DIR simulation via ADIR, and a soft-teacher fusion scheme that combines logits across teachers; experimental results on the TAU Urban Acoustic Scene 2022 Mobile Development dataset demonstrate improved robustness and efficiency. The approach enables robust ASC deployment on mobile and embedded devices, addressing inter-device variations and resource constraints.

Abstract

Acoustic Scene Classification (ASC) identifies an environment based on an audio signal. This paper explores ASC in low-resource conditions and proposes a novel model, DS-FlexiNet, which combines depthwise separable convolutions from MobileNetV2 with ResNet-inspired residual connections for a balance of efficiency and accuracy. To address hardware limitations and device heterogeneity, DS-FlexiNet employs Quantization Aware Training (QAT) for model compression and data augmentation methods like Auto Device Impulse Response (ADIR) and Freq-MixStyle (FMS) to improve cross-device generalization. Knowledge Distillation (KD) from twelve teacher models further enhances performance on unseen devices. The architecture includes a custom Residual Normalization layer to handle domain differences across devices, and depthwise separable convolutions reduce computational overhead without sacrificing feature representation. Experimental results show that DS-FlexiNet excels in both adaptability and performance under resource-constrained conditions.

Improving Acoustic Scene Classification in Low-Resource Conditions

TL;DR

This work addresses acoustic scene classification (ASC) under low-resource conditions and cross-device variability. It proposes DS-FlexiNet, a lightweight architecture that fuses depthwise separable convolutions with ResNet-like residual connections, augmented by Quantization Aware Training (QAT), Auto Device Impulse Response (ADIR), and Freq-MixStyle (FMS), plus Knowledge Distillation (KD) from twelve teacher models. Key innovations include Residual Normalization with RN(x)=\lambda\cdot x+IN(x) to capture device-specific features, frequency-based augmentation via FMS, energy-aware DIR simulation via ADIR, and a soft-teacher fusion scheme that combines logits across teachers; experimental results on the TAU Urban Acoustic Scene 2022 Mobile Development dataset demonstrate improved robustness and efficiency. The approach enables robust ASC deployment on mobile and embedded devices, addressing inter-device variations and resource constraints.

Abstract

Acoustic Scene Classification (ASC) identifies an environment based on an audio signal. This paper explores ASC in low-resource conditions and proposes a novel model, DS-FlexiNet, which combines depthwise separable convolutions from MobileNetV2 with ResNet-inspired residual connections for a balance of efficiency and accuracy. To address hardware limitations and device heterogeneity, DS-FlexiNet employs Quantization Aware Training (QAT) for model compression and data augmentation methods like Auto Device Impulse Response (ADIR) and Freq-MixStyle (FMS) to improve cross-device generalization. Knowledge Distillation (KD) from twelve teacher models further enhances performance on unseen devices. The architecture includes a custom Residual Normalization layer to handle domain differences across devices, and depthwise separable convolutions reduce computational overhead without sacrificing feature representation. Experimental results show that DS-FlexiNet excels in both adaptability and performance under resource-constrained conditions.
Paper Structure (19 sections, 5 equations, 3 figures, 5 tables)

This paper contains 19 sections, 5 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Network Architecture
  • Figure 2: Energy Distribution Analysis of Audio in the TAU22 Dataset.
  • Figure 3: The confusion matrix classified by different dimensions: MS:metro_station, PS: public_square, SM: shopping_mall, SP: street_pedestrian, ST: street_traffic. The comparison experiment is based on the sm4 model.