Table of Contents
Fetching ...

R-Spin: Efficient Speaker and Noise-invariant Representation Learning with Acoustic Pieces

Heng-Jui Chang, James Glass

TL;DR

R-Spin addresses the challenge of extracting robust content representations from speech in a self-supervised setting by extending Spin with noise-invariant training and Acoustic Piece-based pseudo-label learning. It integrates two views of perturbed speech into a codebook-based frame quantization with swapped predictions ($\mathcal{L}_{\text{Spin}}$) and adds an auxiliary loss $\mathcal{L}_{\text{Aux}}$ guided by Acoustic Pieces to prevent collapse, yielding full-model fine-tuning capabilities. The approach achieves a reported 12x reduction in computation compared to prior art and demonstrates enhanced robustness to distorted speech on phoneme and ASR tasks, supported by extensive analyses of discrete acoustic units and invariance properties. These findings illuminate how discrete units can guide SSL training toward more speaker- and noise-invariant content representations, enabling efficient, domain-specific speech learning with practical impact for robust speech processing systems.

Abstract

This paper introduces Robust Spin (R-Spin), a data-efficient domain-specific self-supervision method for speaker and noise-invariant speech representations by learning discrete acoustic units with speaker-invariant clustering (Spin). R-Spin resolves Spin's issues and enhances content representations by learning to predict acoustic pieces. R-Spin offers a 12X reduction in computational resources compared to previous state-of-the-art methods while outperforming them in severely distorted speech scenarios. This paper provides detailed analyses to show how discrete units contribute to speech encoder training and improving robustness in diverse acoustic environments.

R-Spin: Efficient Speaker and Noise-invariant Representation Learning with Acoustic Pieces

TL;DR

R-Spin addresses the challenge of extracting robust content representations from speech in a self-supervised setting by extending Spin with noise-invariant training and Acoustic Piece-based pseudo-label learning. It integrates two views of perturbed speech into a codebook-based frame quantization with swapped predictions () and adds an auxiliary loss guided by Acoustic Pieces to prevent collapse, yielding full-model fine-tuning capabilities. The approach achieves a reported 12x reduction in computation compared to prior art and demonstrates enhanced robustness to distorted speech on phoneme and ASR tasks, supported by extensive analyses of discrete acoustic units and invariance properties. These findings illuminate how discrete units can guide SSL training toward more speaker- and noise-invariant content representations, enabling efficient, domain-specific speech learning with practical impact for robust speech processing systems.

Abstract

This paper introduces Robust Spin (R-Spin), a data-efficient domain-specific self-supervision method for speaker and noise-invariant speech representations by learning discrete acoustic units with speaker-invariant clustering (Spin). R-Spin resolves Spin's issues and enhances content representations by learning to predict acoustic pieces. R-Spin offers a 12X reduction in computational resources compared to previous state-of-the-art methods while outperforming them in severely distorted speech scenarios. This paper provides detailed analyses to show how discrete units contribute to speech encoder training and improving robustness in diverse acoustic environments.
Paper Structure (41 sections, 4 equations, 16 figures, 6 tables)

This paper contains 41 sections, 4 equations, 16 figures, 6 tables.

Figures (16)

  • Figure 1: The proposed R-Spin domain-specific self-supervision framework. The input utterance is perturbed into a different voice and distorted with random noise. Both the original and perturbed views are fed into an encoder initialized with an SSL pre-trained model. The model is optimized with Speaker-invariant Clustering (Spin) chang2023spin objective ($\mathcal{L}_{\text{Spin}}$) and frame-wise pseudo-label prediction loss ($\mathcal{L}_{\text{Aux}}$).
  • Figure 2: Phoneme error rates (PER) under different noise types and SNRs. R-Spin32, AP40k is used here.
  • Figure 3: t-SNE van2008tsne visualization of the CNN and the layer with the lowest speaker identification rate given the same clean utterance spoken by three speakers from TIMIT garofolo1993timit. Each color represents a speaker, while each label visualizes a frame and the corresponding phoneme label. The transcription is "Don't ask me to carry an oily rag like that." The silence frames are omitted for clarity.
  • Figure 4: Layer-wise perturbation invariability analyses with Linear CKA, where higher values indicate higher invariability to perturbations. The zeroth layer denotes the CNN feature extractor.
  • Figure 5: Layer-wise speaker identification accuracy.
  • ...and 11 more figures