SPEAR: A Unified SSL Framework for Learning Speech and Audio Representations
Xiaoyu Yang, Yifan Yang, Zengrui Jin, Ziyun Cui, Wen Wu, Baoxiang Li, Chao Zhang, Phil Woodland
TL;DR
SPEAR introduces a unified self-supervised framework that bridges speech and general audio representation learning by distilling knowledge from domain-specific SSL teachers into a single encoder. It uses multi-codebook vector quantisation to generate fine-grained MVQ tokens from continuous teacher representations and employs an asymmetric dual-domain pre-training objective alongside a token-mixing augmentation to handle complex sound scenes. The approach yields state-of-the-art results on the SUPERB benchmark and strong performance on HEAR, while enabling substantial cross-domain transfer and a versatile footing for both speech and audio tasks. The work demonstrates that combining KD with fine-grained token prediction, plus targeted data fusion strategies, can produce a robust, general-purpose acoustic representation model with practical impact for diverse audio-processing applications.
Abstract
Self-supervised learning (SSL) has significantly advanced acoustic representation learning. However, most existing models are optimised for either speech or audio event understanding, resulting in a persistent gap between these two domains. We address this gap with SPEAR (SPEech and Audio Representations), a self-supervised framework that distils complementary knowledge from a speech-focused SSL teacher and a general-audio SSL teacher into a single unified model. SPEAR applies multi-codebook vector quantisation to continuous teacher representations to produce fine-grained discrete tokens that capture both semantic and acoustic information. To effectively integrate these heterogeneous representations, SPEAR jointly predicts them given a masked input with an asymmetric pre-training loss. We further improve robustness in complex sound scenes through a novel token mixing mechanism. Extensive experiments demonstrate that SPEAR consistently outperforms existing unified speech and audio models. SPEAR establishes a new state-of-the-art on the SUPERB benchmark, surpassing WavLM Large on 12 of 15 tasks, while achieving competitive performance on the HEAR benchmark. These results position SPEAR as a versatile foundation for general-purpose speech and audio representation learning. The code and pre-trained models will be released.
