Leveraging Self-supervised Audio Representations for Data-Efficient Acoustic Scene Classification

Yiqiang Cai; Shengchen Li; Xi Shao

Leveraging Self-supervised Audio Representations for Data-Efficient Acoustic Scene Classification

Yiqiang Cai, Shengchen Li, Xi Shao

TL;DR

This work tackles data-efficient acoustic scene classification (ASC) by leveraging self-supervised audio representations learned on large unlabeled datasets. It uses BEATs, a SSL model trained on AudioSet, as a feature extractor with both frozen and unfrozen fine-tuning strategies and ensembling to boost accuracy under limited labeled data. To meet low-complexity constraints, it distills knowledge from BEATs teachers into a compact TF-SepNet-64 student, achieving a best average accuracy of $56.7\%$ on DCASE 2024 Task 1. Overall, the approach demonstrates that SSL pre-training, model ensembling, and teacher-student distillation can deliver strong ASC performance in data-scarce and resource-constrained settings, with practical implications for scalable ASC systems.

Abstract

Acoustic scene classification (ASC) predominantly relies on supervised approaches. However, acquiring labeled data for training ASC models is often costly and time-consuming. Recently, self-supervised learning (SSL) has emerged as a powerful method for extracting features from unlabeled audio data, benefiting many downstream audio tasks. This paper proposes a data-efficient and low-complexity ASC system by leveraging self-supervised audio representations extracted from general-purpose audio datasets. We introduce BEATs, an audio SSL pre-trained model, to extract the general representations from AudioSet. Through extensive experiments, it has been demonstrated that the self-supervised audio representations can help to achieve high ASC accuracy with limited labeled fine-tuning data. Furthermore, we find that ensembling the SSL models fine-tuned with different strategies contributes to a further performance improvement. To meet low-complexity requirements, we use knowledge distillation to transfer the self-supervised knowledge from large teacher models to an efficient student model. The experimental results suggest that the self-supervised teachers effectively improve the classification accuracy of the student model. Our best-performing system obtains an average accuracy of 56.7%.

Leveraging Self-supervised Audio Representations for Data-Efficient Acoustic Scene Classification

TL;DR

on DCASE 2024 Task 1. Overall, the approach demonstrates that SSL pre-training, model ensembling, and teacher-student distillation can deliver strong ASC performance in data-scarce and resource-constrained settings, with practical implications for scalable ASC systems.

Abstract

Paper Structure (16 sections, 2 equations, 2 figures, 3 tables)

This paper contains 16 sections, 2 equations, 2 figures, 3 tables.

Introduction
Self-supervised Pre-training and Fine-tuning
BEATs
Frozen Fine-tuning
Unfrozen Fine-tuning
Ensemble Models
Knowledge Distillation with Self-supervised Teachers
TF-SepNet-64
Knowledge Distillation
Experimental Setup
Results
Performance of Fine-tuned BEATs
TF-SepNet-64 with BEATs Teachers
Ablation Study
Conclusion
...and 1 more sections

Figures (2)

Figure 1: Proposed data-efficient and low-complexity ASC system. (a) Self-supervised pre-training BEATs on AudioSet. (b) Fine-tuning pre-trained BEATs on ASC dataset. (c) Distilling knowledge from fine-tuned BEATs to TF-SepNet-64. Snowflake icon indicates that the parameters of the corresponding part are frozen, while Flame icon indicates the opposite.
Figure 2: TSNE van2008visualizing visualization of acoustic scene features extracted by TF-SepNet-64, which is trained on the 5% subset. Left: Knowledge distillation is not applied. Right: Distilling knowledge from the 3 ensemble BEATs teacher.

Leveraging Self-supervised Audio Representations for Data-Efficient Acoustic Scene Classification

TL;DR

Abstract

Leveraging Self-supervised Audio Representations for Data-Efficient Acoustic Scene Classification

Authors

TL;DR

Abstract

Table of Contents

Figures (2)