Learning Disentangled Speech Representations
Yusuf Brima, Ulf Krumnack, Simone Pika, Gunther Heidemann
TL;DR
This work tackles the lack of benchmark resources for disentangled representations in speech by introducing SynSpeech, a large-scale synthetic dataset with controlled variations in speaker identity, text content, gender, and speaking style, released in three versions to support experiments at different complexities. It establishes a comprehensive evaluation framework that combines linear probing and supervised disentanglement metrics (including IRS, MIG, JEMMIG, and an Explicitness Score) to quantify modularity, compactness, and explicitness of learned latent representations, demonstrated on a state-of-the-art RAVE-based model. The results show that simpler factors such as gender and speaking style are more amenable to disentanglement, while complex attributes like speaker identity remain partially entangled, especially under single-dimension encodings; aggregating dimensions improves predictive performance but reduces interpretability. SynSpeech thus provides a standardized, reproducible benchmark for comparing disentangled speech representation methods and guides future research toward more robust, interpretable, and transferable speech models.
Abstract
Disentangled representation learning in speech processing has lagged behind other domains, largely due to the lack of datasets with annotated generative factors for robust evaluation. To address this, we propose SynSpeech, a novel large-scale synthetic speech dataset specifically designed to enable research on disentangled speech representations. SynSpeech includes controlled variations in speaker identity, spoken text, and speaking style, with three dataset versions to support experimentation at different levels of complexity. In this study, we present a comprehensive framework to evaluate disentangled representation learning techniques, applying both linear probing and established supervised disentanglement metrics to assess the modularity, compactness, and informativeness of the representations learned by a state-of-the-art model. Using the RAVE model as a test case, we find that SynSpeech facilitates benchmarking across a range of factors, achieving promising disentanglement of simpler features like gender and speaking style, while highlighting challenges in isolating complex attributes like speaker identity. This benchmark dataset and evaluation framework fills a critical gap, supporting the development of more robust and interpretable speech representation learning methods.
