Table of Contents
Fetching ...

DinoSR: Self-Distillation and Online Clustering for Self-supervised Speech Representation Learning

Alexander H. Liu, Heng-Jui Chang, Michael Auli, Wei-Ning Hsu, James R. Glass

TL;DR

DinoSR tackles self-supervised speech representation learning by unifying masked input modeling, self-distillation, and online clustering to produce contextually grounded, discrete acoustic units. The framework employs a teacher–student Transformer with EMA updating, where the teacher guides the masked student using targets derived from online clustering across the top layers via multiple codebooks: $\mathbf{E}^k$ with $V$ codewords per layer. Training optimizes a cross-entropy objective to predict codeword indices, enabling end-to-end learning without offline clustering. Empirical results show state-of-the-art or competitive performance on LibriSpeech ASR, ZeroSpeech acoustic-unit discovery, and SUPERB benchmarks, along with strong interpretability and data efficiency. The work highlights a scalable pathway from continuous speech to discrete tokens and suggests promising cross-language applicability and further scaling opportunities.

Abstract

In this paper, we introduce self-distillation and online clustering for self-supervised speech representation learning (DinoSR) which combines masked language modeling, self-distillation, and online clustering. We show that these concepts complement each other and result in a strong representation learning model for speech. DinoSR first extracts contextualized embeddings from the input audio with a teacher network, then runs an online clustering system on the embeddings to yield a machine-discovered phone inventory, and finally uses the discretized tokens to guide a student network. We show that DinoSR surpasses previous state-of-the-art performance in several downstream tasks, and provide a detailed analysis of the model and the learned discrete units.

DinoSR: Self-Distillation and Online Clustering for Self-supervised Speech Representation Learning

TL;DR

DinoSR tackles self-supervised speech representation learning by unifying masked input modeling, self-distillation, and online clustering to produce contextually grounded, discrete acoustic units. The framework employs a teacher–student Transformer with EMA updating, where the teacher guides the masked student using targets derived from online clustering across the top layers via multiple codebooks: with codewords per layer. Training optimizes a cross-entropy objective to predict codeword indices, enabling end-to-end learning without offline clustering. Empirical results show state-of-the-art or competitive performance on LibriSpeech ASR, ZeroSpeech acoustic-unit discovery, and SUPERB benchmarks, along with strong interpretability and data efficiency. The work highlights a scalable pathway from continuous speech to discrete tokens and suggests promising cross-language applicability and further scaling opportunities.

Abstract

In this paper, we introduce self-distillation and online clustering for self-supervised speech representation learning (DinoSR) which combines masked language modeling, self-distillation, and online clustering. We show that these concepts complement each other and result in a strong representation learning model for speech. DinoSR first extracts contextualized embeddings from the input audio with a teacher network, then runs an online clustering system on the embeddings to yield a machine-discovered phone inventory, and finally uses the discretized tokens to guide a student network. We show that DinoSR surpasses previous state-of-the-art performance in several downstream tasks, and provide a detailed analysis of the model and the learned discrete units.
Paper Structure (28 sections, 7 equations, 11 figures, 7 tables, 1 algorithm)

This paper contains 28 sections, 7 equations, 11 figures, 7 tables, 1 algorithm.

Figures (11)

  • Figure 1: An overview of DinoSR: the teacher network is an exponential moving average of the student network and takes unmasked speech as input to extract target features. Online clustering is applied to multiple layers of the teacher, each with a separate codebook. The student network is trained to predict the corresponding clusters of masked input. Both teacher network and online clustering (shadowed regions) do not require gradients.
  • Figure 2: The trade-off between performance (WER on LibriSpeech dev-other) and data efficiency (hours of speech the model processed in total during pre-training) for different methods.
  • Figure 3: Varying codebook size $V$ and the number of codebooks $N$.
  • Figure 4: The conditional probability $P(\text{phone}|\text{code})$ on LibriSpeech dev set visualized. The y-axis is the phone set sorted by the number of occurrences, the x-axis is the 217 active codewords sorted by the most correlated phone. A larger figure for clarity is provided in §\ref{['subsec:vis']}.
  • Figure 5: $P(\text{phone}|\text{code})$ from DinoSR with 217 codewords activated out of 256.
  • ...and 6 more figures