Table of Contents
Fetching ...

SONAR: Self-Distilled Continual Pre-training for Domain Adaptive Audio Representation

Yizhou Zhang, Yuan Gao, Wangjin Zhou, Zicheng Yuan, Keisuke Imoto, Tatsuya Kawahara

TL;DR

This work tackles continual pre-training for domain-adaptive audio representation learning in the absence of full historical data. It proposes SONAR, a BEATs-based framework that combines Task-Relevant Stratified Sampling, Dual-source Self-Distillation Regularization (tokenizer- and model-level), and an Online Clustered Codebook to dynamically evolve the tokenizer. Across four diverse, unlabeled-domain datasets, SONAR demonstrates strong adaptability while significantly mitigating catastrophic forgetting, outperforming BEATs and Direct Continual Pre-training on downstream tasks and maintaining stable AudioSet performance. The approach offers a practical pathway for scalable, continual SSL in heterogeneous audio environments, enabling robust representations for speech, music, bioacoustics, and environmental sounds.

Abstract

Self-supervised learning (SSL) on large-scale datasets like AudioSet has become the dominant paradigm for audio representation learning. While the continuous influx of new, unlabeled audio presents an opportunity to enrich these static representations, a naive approach is to retrain the model from scratch using all available data. However, this method is computationally prohibitive and discards the valuable knowledge embedded in the previously trained model weights. To address this inefficiency, we propose SONAR (Self-distilled cONtinual pre-training for domain adaptive Audio Representation), a continual pre-training framework built upon BEATs. SONAR effectively adapts to new domains while mitigating catastrophic forgetting by tackling three key challenges: implementing a joint sampling strategy for new and prior data, applying regularization to balance specificity and generality, and dynamically expanding the tokenizer codebook for novel acoustic patterns. Experiments across four distinct domains demonstrate that our method achieves both high adaptability and robust resistance to forgetting.

SONAR: Self-Distilled Continual Pre-training for Domain Adaptive Audio Representation

TL;DR

This work tackles continual pre-training for domain-adaptive audio representation learning in the absence of full historical data. It proposes SONAR, a BEATs-based framework that combines Task-Relevant Stratified Sampling, Dual-source Self-Distillation Regularization (tokenizer- and model-level), and an Online Clustered Codebook to dynamically evolve the tokenizer. Across four diverse, unlabeled-domain datasets, SONAR demonstrates strong adaptability while significantly mitigating catastrophic forgetting, outperforming BEATs and Direct Continual Pre-training on downstream tasks and maintaining stable AudioSet performance. The approach offers a practical pathway for scalable, continual SSL in heterogeneous audio environments, enabling robust representations for speech, music, bioacoustics, and environmental sounds.

Abstract

Self-supervised learning (SSL) on large-scale datasets like AudioSet has become the dominant paradigm for audio representation learning. While the continuous influx of new, unlabeled audio presents an opportunity to enrich these static representations, a naive approach is to retrain the model from scratch using all available data. However, this method is computationally prohibitive and discards the valuable knowledge embedded in the previously trained model weights. To address this inefficiency, we propose SONAR (Self-distilled cONtinual pre-training for domain adaptive Audio Representation), a continual pre-training framework built upon BEATs. SONAR effectively adapts to new domains while mitigating catastrophic forgetting by tackling three key challenges: implementing a joint sampling strategy for new and prior data, applying regularization to balance specificity and generality, and dynamically expanding the tokenizer codebook for novel acoustic patterns. Experiments across four distinct domains demonstrate that our method achieves both high adaptability and robust resistance to forgetting.

Paper Structure

This paper contains 17 sections, 10 equations, 1 figure, 2 tables.

Figures (1)

  • Figure 1: Overview of the SONAR framework for continual self-supervised audio representation learning, as described in Section 3. This framework integrates task relevant stratified sampling (Section 3.1), dual-source self-distillation (Section 3.2), and an online clustered codebook (Section 3.3) for dynamic adaptation to novel acoustic patterns. The approach enables efficient model adaptation across multiple domains while mitigating catastrophic forgetting.