Table of Contents
Fetching ...

Lightweight and Generalizable Acoustic Scene Representations via Contrastive Fine-Tuning and Distillation

Kuang Yuan, Yang Gao, Xilin Li, Xinhao Mei, Syavosh Zadissa, Tarun Pruthi, Saeed Bagheri Sereshki

TL;DR

ContrastASC addresses the transferability gap in acoustic scene classification on edge devices by learning generalizable embeddings through supervised contrastive fine-tuning and preserving relational structure via Contrastive Representation Distillation (CRD). The two-stage pipeline fine-tunes a BEATs teacher with mixup-aware Soft-SupCon and a cosine classifier, then distills the embedding relations to a compact CP-Mobile student using CRD with 2-layer projections and LayerNorm. Empirically, the approach maintains solid closed-set accuracy on TAU22 while substantially improving few-shot open-set generalization to unseen categories; CRD further enhances transferability across CP-Mobile variants, with performance gains scaling with model size. This yields lightweight, transferable ASC representations suitable for on-device deployment and real-world adaptation, with future work exploring teacher ensembles to further boost generalization.

Abstract

Acoustic scene classification (ASC) models on edge devices typically operate under fixed class assumptions, lacking the transferability needed for real-world applications that require adaptation to new or refined acoustic categories. We propose ContrastASC, which learns generalizable acoustic scene representations by structuring the embedding space to preserve semantic relationships between scenes, enabling adaptation to unseen categories without retraining. Our approach combines supervised contrastive fine-tuning of pre-trained models with contrastive representation distillation to transfer this structured knowledge to compact student models. Our evaluation shows that ContrastASC demonstrates improved few-shot adaptation to unseen categories while maintaining strong closed-set performance.

Lightweight and Generalizable Acoustic Scene Representations via Contrastive Fine-Tuning and Distillation

TL;DR

ContrastASC addresses the transferability gap in acoustic scene classification on edge devices by learning generalizable embeddings through supervised contrastive fine-tuning and preserving relational structure via Contrastive Representation Distillation (CRD). The two-stage pipeline fine-tunes a BEATs teacher with mixup-aware Soft-SupCon and a cosine classifier, then distills the embedding relations to a compact CP-Mobile student using CRD with 2-layer projections and LayerNorm. Empirically, the approach maintains solid closed-set accuracy on TAU22 while substantially improving few-shot open-set generalization to unseen categories; CRD further enhances transferability across CP-Mobile variants, with performance gains scaling with model size. This yields lightweight, transferable ASC representations suitable for on-device deployment and real-world adaptation, with future work exploring teacher ensembles to further boost generalization.

Abstract

Acoustic scene classification (ASC) models on edge devices typically operate under fixed class assumptions, lacking the transferability needed for real-world applications that require adaptation to new or refined acoustic categories. We propose ContrastASC, which learns generalizable acoustic scene representations by structuring the embedding space to preserve semantic relationships between scenes, enabling adaptation to unseen categories without retraining. Our approach combines supervised contrastive fine-tuning of pre-trained models with contrastive representation distillation to transfer this structured knowledge to compact student models. Our evaluation shows that ContrastASC demonstrates improved few-shot adaptation to unseen categories while maintaining strong closed-set performance.

Paper Structure

This paper contains 5 sections, 4 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Overview of the proposed 2-stage training framework of ContrastASC in contrast to the conventional approach.
  • Figure 2: Close-set and open-set accuracies across model sizes