DinoSR: Self-Distillation and Online Clustering for Self-supervised Speech Representation Learning
Alexander H. Liu, Heng-Jui Chang, Michael Auli, Wei-Ning Hsu, James R. Glass
TL;DR
DinoSR tackles self-supervised speech representation learning by unifying masked input modeling, self-distillation, and online clustering to produce contextually grounded, discrete acoustic units. The framework employs a teacher–student Transformer with EMA updating, where the teacher guides the masked student using targets derived from online clustering across the top layers via multiple codebooks: $\mathbf{E}^k$ with $V$ codewords per layer. Training optimizes a cross-entropy objective to predict codeword indices, enabling end-to-end learning without offline clustering. Empirical results show state-of-the-art or competitive performance on LibriSpeech ASR, ZeroSpeech acoustic-unit discovery, and SUPERB benchmarks, along with strong interpretability and data efficiency. The work highlights a scalable pathway from continuous speech to discrete tokens and suggests promising cross-language applicability and further scaling opportunities.
Abstract
In this paper, we introduce self-distillation and online clustering for self-supervised speech representation learning (DinoSR) which combines masked language modeling, self-distillation, and online clustering. We show that these concepts complement each other and result in a strong representation learning model for speech. DinoSR first extracts contextualized embeddings from the input audio with a teacher network, then runs an online clustering system on the embeddings to yield a machine-discovered phone inventory, and finally uses the discretized tokens to guide a student network. We show that DinoSR surpasses previous state-of-the-art performance in several downstream tasks, and provide a detailed analysis of the model and the learned discrete units.
