R-Spin: Efficient Speaker and Noise-invariant Representation Learning with Acoustic Pieces
Heng-Jui Chang, James Glass
TL;DR
R-Spin addresses the challenge of extracting robust content representations from speech in a self-supervised setting by extending Spin with noise-invariant training and Acoustic Piece-based pseudo-label learning. It integrates two views of perturbed speech into a codebook-based frame quantization with swapped predictions ($\mathcal{L}_{\text{Spin}}$) and adds an auxiliary loss $\mathcal{L}_{\text{Aux}}$ guided by Acoustic Pieces to prevent collapse, yielding full-model fine-tuning capabilities. The approach achieves a reported 12x reduction in computation compared to prior art and demonstrates enhanced robustness to distorted speech on phoneme and ASR tasks, supported by extensive analyses of discrete acoustic units and invariance properties. These findings illuminate how discrete units can guide SSL training toward more speaker- and noise-invariant content representations, enabling efficient, domain-specific speech learning with practical impact for robust speech processing systems.
Abstract
This paper introduces Robust Spin (R-Spin), a data-efficient domain-specific self-supervision method for speaker and noise-invariant speech representations by learning discrete acoustic units with speaker-invariant clustering (Spin). R-Spin resolves Spin's issues and enhances content representations by learning to predict acoustic pieces. R-Spin offers a 12X reduction in computational resources compared to previous state-of-the-art methods while outperforming them in severely distorted speech scenarios. This paper provides detailed analyses to show how discrete units contribute to speech encoder training and improving robustness in diverse acoustic environments.
