Data Efficient Acoustic Scene Classification using Teacher-Informed Confusing Class Instruction

Jin Jie Sean Yeo; Ee-Leng Tan; Jisheng Bai; Santi Peksi; Woon-Seng Gan

Data Efficient Acoustic Scene Classification using Teacher-Informed Confusing Class Instruction

Jin Jie Sean Yeo, Ee-Leng Tan, Jisheng Bai, Santi Peksi, Woon-Seng Gan

TL;DR

The paper tackles data-efficient acoustic scene classification under tight model-parameter constraints for the DCASE 2024 Task 1. It proposes three systems: an optimized N-Base Channel Baseline (N-BCBL) for small data, a Knowledge Distillation Ensemble (KD-Ensemble) combining multi-teacher predictions, and a Teacher-Focused Student (TFS) that leverages FocusNet to emphasize confusing classes, guided by a KD-trained teacher. Through careful preprocessing, augmentation (Frequency-MixStyle, Mixup, DIR), and selective channel scaling, the approach achieves strong performance across splits on the TAU Urban Acoustic Scene 2022 Mobile dataset, with KD-Ensemble excelling on all but the largest split where TFS leads. The work demonstrates that teacher-student frameworks and targeted attention to confusing classes can yield substantial gains in data-limited ASC, suggesting further improvements through ensembling of KD and TFS models under strict resource constraints.

Abstract

In this technical report, we describe the SNTL-NTU team's submission for Task 1 Data-Efficient Low-Complexity Acoustic Scene Classification of the detection and classification of acoustic scenes and events (DCASE) 2024 challenge. Three systems are introduced to tackle training splits of different sizes. For small training splits, we explored reducing the complexity of the provided baseline model by reducing the number of base channels. We introduce data augmentation in the form of mixup to increase the diversity of training samples. For the larger training splits, we use FocusNet to provide confusing class information to an ensemble of multiple Patchout faSt Spectrogram Transformer (PaSST) models and baseline models trained on the original sampling rate of 44.1 kHz. We use Knowledge Distillation to distill the ensemble model to the baseline student model. Training the systems on the TAU Urban Acoustic Scene 2022 Mobile development dataset yielded the highest average testing accuracy of (62.21, 59.82, 56.81, 53.03, 47.97)% on split (100, 50, 25, 10, 5)% respectively over the three systems.

Data Efficient Acoustic Scene Classification using Teacher-Informed Confusing Class Instruction

TL;DR

Abstract

Paper Structure (16 sections, 8 equations, 3 figures, 3 tables)

This paper contains 16 sections, 8 equations, 3 figures, 3 tables.

INTRODUCTION
DATA PREPROCESSING AND AUGMENTATION
Preprocessing
Frequency-MixStyle Device Impulse Response Augmentation and Mixup
Channel Scaling
SUBMITTED MODELS
N-Base Channel Baseline Model
Knowledge Distillation Ensemble Model
Teacher-Focused Student Model
Model Architecture
TEACHER-STUDENT TRAINING METHODS
Knowledge Distillation
FocusNet
Results
CONCLUSION
...and 1 more sections

Figures (3)

Figure 1: CPM blocks: (1) Transition Block (input channels $\neq$ output channels put channels), (2) Standard Block, (3) Spatial Downsampling Block
Figure 2: Class-wise Accuracy of Student models and Baseline for the 100% split
Figure 3: Class-wise Accuracy of student models and Baseline on various splits: (a) 5%, (b) 10%, (c) 25%, (d) 50%

Data Efficient Acoustic Scene Classification using Teacher-Informed Confusing Class Instruction

TL;DR

Abstract

Data Efficient Acoustic Scene Classification using Teacher-Informed Confusing Class Instruction

Authors

TL;DR

Abstract

Table of Contents

Figures (3)