Understanding Self-Supervised Learning of Speech Representation via Invariance and Redundancy Reduction
Yusuf Brima, Ulf Krumnack, Simone Pika, Gunther Heidemann
TL;DR
The paper investigates self-supervised speech representation learning using Barlow Twins (BT), focusing on invariance and redundancy reduction as core inductive priors. It provides an empirical analysis of BT's downstream transfer, potential for disentanglement, and the effects of loss-variant ablations, showing that BT affords strong transfer with high-quality upstream data but falls short on fine-grained factorization of latent factors. The study demonstrates generalization across tasks and domain shifts, and highlights that dataset quality can outperform sheer size in driving transfer, while disentanglement remains a key challenge. It concludes by outlining pathways to improve BT via perceptual priors and additional inductive biases to move toward more hierarchical, decoupled representations for speech.
Abstract
Self-supervised learning (SSL) has emerged as a promising paradigm for learning flexible speech representations from unlabeled data. By designing pretext tasks that exploit statistical regularities, SSL models can capture useful representations that are transferable to downstream tasks. This study provides an empirical analysis of Barlow Twins (BT), an SSL technique inspired by theories of redundancy reduction in human perception. On downstream tasks, BT representations accelerated learning and transferred across domains. However, limitations exist in disentangling key explanatory factors, with redundancy reduction and invariance alone insufficient for factorization of learned latents into modular, compact, and informative codes. Our ablations study isolated gains from invariance constraints, but the gains were context-dependent. Overall, this work substantiates the potential of Barlow Twins for sample-efficient speech encoding. However, challenges remain in achieving fully hierarchical representations. The analysis methodology and insights pave a path for extensions incorporating further inductive priors and perceptual principles to further enhance the BT self-supervision framework.
