Learning Disentangled Speech Representations

Yusuf Brima; Ulf Krumnack; Simone Pika; Gunther Heidemann

Learning Disentangled Speech Representations

Yusuf Brima, Ulf Krumnack, Simone Pika, Gunther Heidemann

TL;DR

This work tackles the lack of benchmark resources for disentangled representations in speech by introducing SynSpeech, a large-scale synthetic dataset with controlled variations in speaker identity, text content, gender, and speaking style, released in three versions to support experiments at different complexities. It establishes a comprehensive evaluation framework that combines linear probing and supervised disentanglement metrics (including IRS, MIG, JEMMIG, and an Explicitness Score) to quantify modularity, compactness, and explicitness of learned latent representations, demonstrated on a state-of-the-art RAVE-based model. The results show that simpler factors such as gender and speaking style are more amenable to disentanglement, while complex attributes like speaker identity remain partially entangled, especially under single-dimension encodings; aggregating dimensions improves predictive performance but reduces interpretability. SynSpeech thus provides a standardized, reproducible benchmark for comparing disentangled speech representation methods and guides future research toward more robust, interpretable, and transferable speech models.

Abstract

Disentangled representation learning in speech processing has lagged behind other domains, largely due to the lack of datasets with annotated generative factors for robust evaluation. To address this, we propose SynSpeech, a novel large-scale synthetic speech dataset specifically designed to enable research on disentangled speech representations. SynSpeech includes controlled variations in speaker identity, spoken text, and speaking style, with three dataset versions to support experimentation at different levels of complexity. In this study, we present a comprehensive framework to evaluate disentangled representation learning techniques, applying both linear probing and established supervised disentanglement metrics to assess the modularity, compactness, and informativeness of the representations learned by a state-of-the-art model. Using the RAVE model as a test case, we find that SynSpeech facilitates benchmarking across a range of factors, achieving promising disentanglement of simpler features like gender and speaking style, while highlighting challenges in isolating complex attributes like speaker identity. This benchmark dataset and evaluation framework fills a critical gap, supporting the development of more robust and interpretable speech representation learning methods.

Learning Disentangled Speech Representations

TL;DR

Abstract

Paper Structure (25 sections, 5 equations, 6 figures, 6 tables)

This paper contains 25 sections, 5 equations, 6 figures, 6 tables.

Introduction
Methodology
SynSpeech Dataset
General Setup
Evaluation Metrics
Linear Probing
Supervised Disentanglement Evaluation
Intervention-based Metrics
Information-based Metrics
Predictor-based Metrics
Explicitness Score
Results and Analysis
Linear Probing of Latent Dimensions
Supervised Disentanglement Evaluation
Speaker Identification
...and 10 more sections

Figures (6)

Figure 1: Illustration of the Neural Speech Synthesizer process. The model takes three primary inputs: spoken text $S^{(i)}$, speaker identity $I^{(j)}$, and speaking style $E^{(l)}$. These inputs are combined through the synthesis function $r_\theta(S^{(i)}, I^{(j)}, E^{(l)})$ to generate the final utterance $U^{(t)}$, capturing the specified content, speaker characteristics, and style.
Figure 2: Illustration of supervised disentanglement learning notation. The figure represents the transformation from factor space $\mathcal{V}$, containing generative factors, to input space $\mathcal{X}$ via the generative process $g(\cdot)$. The learned representation function $r_\theta(\cdot)$ further maps input data from $\mathcal{X}$ to latent space $\mathcal{Z}$, where disentangled representations are formed. This setup facilitates the evaluation of disentanglement by comparing the generative factors in $\mathcal{V}$ with their corresponding representations in $\mathcal{Z}$.
Figure 3: Architecture of the multiband Beta-VAE model with spectral distance and adversarial fine-tuning. This setup includes the multi-band decomposition, which processes each frequency band independently to enhance spectral fidelity, as well as adversarial fine-tuning to improve realism.
Figure 4: LP Accuracy of Latent Dimensions for Speaker ID, Gender, and Speaking Style on the medium-sized dataset with reported mean and standard deviation for 5 experimental runs.
Figure 5: Comparison of original, reconstructed, and generated waveforms and spectrograms. The top row presents the waveform representations, illustrating the time-domain characteristics of the input, reconstructed, and generated audio signals. The bottom row shows the corresponding spectrograms, visualizing frequency content over time. The consistency between the original and reconstructed signals indicates effective preservation of temporal and spectral features, while the generated signal demonstrates the model’s ability to synthesize a plausible approximation of the original audio. Differences in the fine structure of the generated spectrogram suggest areas for improvement in capturing higher-frequency details and transient elements.
...and 1 more figures

Learning Disentangled Speech Representations

TL;DR

Abstract

Learning Disentangled Speech Representations

Authors

TL;DR

Abstract

Table of Contents

Figures (6)