Leveraging Self-supervised Audio Representations for Data-Efficient Acoustic Scene Classification
Yiqiang Cai, Shengchen Li, Xi Shao
TL;DR
This work tackles data-efficient acoustic scene classification (ASC) by leveraging self-supervised audio representations learned on large unlabeled datasets. It uses BEATs, a SSL model trained on AudioSet, as a feature extractor with both frozen and unfrozen fine-tuning strategies and ensembling to boost accuracy under limited labeled data. To meet low-complexity constraints, it distills knowledge from BEATs teachers into a compact TF-SepNet-64 student, achieving a best average accuracy of $56.7\%$ on DCASE 2024 Task 1. Overall, the approach demonstrates that SSL pre-training, model ensembling, and teacher-student distillation can deliver strong ASC performance in data-scarce and resource-constrained settings, with practical implications for scalable ASC systems.
Abstract
Acoustic scene classification (ASC) predominantly relies on supervised approaches. However, acquiring labeled data for training ASC models is often costly and time-consuming. Recently, self-supervised learning (SSL) has emerged as a powerful method for extracting features from unlabeled audio data, benefiting many downstream audio tasks. This paper proposes a data-efficient and low-complexity ASC system by leveraging self-supervised audio representations extracted from general-purpose audio datasets. We introduce BEATs, an audio SSL pre-trained model, to extract the general representations from AudioSet. Through extensive experiments, it has been demonstrated that the self-supervised audio representations can help to achieve high ASC accuracy with limited labeled fine-tuning data. Furthermore, we find that ensembling the SSL models fine-tuned with different strategies contributes to a further performance improvement. To meet low-complexity requirements, we use knowledge distillation to transfer the self-supervised knowledge from large teacher models to an efficient student model. The experimental results suggest that the self-supervised teachers effectively improve the classification accuracy of the student model. Our best-performing system obtains an average accuracy of 56.7%.
