Table of Contents
Fetching ...

Leveraging Self-supervised Audio Representations for Data-Efficient Acoustic Scene Classification

Yiqiang Cai, Shengchen Li, Xi Shao

TL;DR

This work tackles data-efficient acoustic scene classification (ASC) by leveraging self-supervised audio representations learned on large unlabeled datasets. It uses BEATs, a SSL model trained on AudioSet, as a feature extractor with both frozen and unfrozen fine-tuning strategies and ensembling to boost accuracy under limited labeled data. To meet low-complexity constraints, it distills knowledge from BEATs teachers into a compact TF-SepNet-64 student, achieving a best average accuracy of $56.7\%$ on DCASE 2024 Task 1. Overall, the approach demonstrates that SSL pre-training, model ensembling, and teacher-student distillation can deliver strong ASC performance in data-scarce and resource-constrained settings, with practical implications for scalable ASC systems.

Abstract

Acoustic scene classification (ASC) predominantly relies on supervised approaches. However, acquiring labeled data for training ASC models is often costly and time-consuming. Recently, self-supervised learning (SSL) has emerged as a powerful method for extracting features from unlabeled audio data, benefiting many downstream audio tasks. This paper proposes a data-efficient and low-complexity ASC system by leveraging self-supervised audio representations extracted from general-purpose audio datasets. We introduce BEATs, an audio SSL pre-trained model, to extract the general representations from AudioSet. Through extensive experiments, it has been demonstrated that the self-supervised audio representations can help to achieve high ASC accuracy with limited labeled fine-tuning data. Furthermore, we find that ensembling the SSL models fine-tuned with different strategies contributes to a further performance improvement. To meet low-complexity requirements, we use knowledge distillation to transfer the self-supervised knowledge from large teacher models to an efficient student model. The experimental results suggest that the self-supervised teachers effectively improve the classification accuracy of the student model. Our best-performing system obtains an average accuracy of 56.7%.

Leveraging Self-supervised Audio Representations for Data-Efficient Acoustic Scene Classification

TL;DR

This work tackles data-efficient acoustic scene classification (ASC) by leveraging self-supervised audio representations learned on large unlabeled datasets. It uses BEATs, a SSL model trained on AudioSet, as a feature extractor with both frozen and unfrozen fine-tuning strategies and ensembling to boost accuracy under limited labeled data. To meet low-complexity constraints, it distills knowledge from BEATs teachers into a compact TF-SepNet-64 student, achieving a best average accuracy of on DCASE 2024 Task 1. Overall, the approach demonstrates that SSL pre-training, model ensembling, and teacher-student distillation can deliver strong ASC performance in data-scarce and resource-constrained settings, with practical implications for scalable ASC systems.

Abstract

Acoustic scene classification (ASC) predominantly relies on supervised approaches. However, acquiring labeled data for training ASC models is often costly and time-consuming. Recently, self-supervised learning (SSL) has emerged as a powerful method for extracting features from unlabeled audio data, benefiting many downstream audio tasks. This paper proposes a data-efficient and low-complexity ASC system by leveraging self-supervised audio representations extracted from general-purpose audio datasets. We introduce BEATs, an audio SSL pre-trained model, to extract the general representations from AudioSet. Through extensive experiments, it has been demonstrated that the self-supervised audio representations can help to achieve high ASC accuracy with limited labeled fine-tuning data. Furthermore, we find that ensembling the SSL models fine-tuned with different strategies contributes to a further performance improvement. To meet low-complexity requirements, we use knowledge distillation to transfer the self-supervised knowledge from large teacher models to an efficient student model. The experimental results suggest that the self-supervised teachers effectively improve the classification accuracy of the student model. Our best-performing system obtains an average accuracy of 56.7%.
Paper Structure (16 sections, 2 equations, 2 figures, 3 tables)

This paper contains 16 sections, 2 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Proposed data-efficient and low-complexity ASC system. (a) Self-supervised pre-training BEATs on AudioSet. (b) Fine-tuning pre-trained BEATs on ASC dataset. (c) Distilling knowledge from fine-tuned BEATs to TF-SepNet-64. Snowflake icon indicates that the parameters of the corresponding part are frozen, while Flame icon indicates the opposite.
  • Figure 2: TSNE van2008visualizing visualization of acoustic scene features extracted by TF-SepNet-64, which is trained on the 5% subset. Left: Knowledge distillation is not applied. Right: Distilling knowledge from the 3 ensemble BEATs teacher.