Learning Disentangled Audio Representations through Controlled Synthesis

Yusuf Brima; Ulf Krumnack; Simone Pika; Gunther Heidemann

Learning Disentangled Audio Representations through Controlled Synthesis

Yusuf Brima, Ulf Krumnack, Simone Pika, Gunther Heidemann

TL;DR

The paper tackles the lack of benchmarks for disentangled audio representations by introducing SynTone, a synthetic dataset with explicit ground-truth factors controlling timbre $\mathbb{T}$, amplitude $\mathbb{A}$, and frequency $\mathbb{F}$. It benchmarks four VAE-based disentanglement methods—vanilla VAE, $\beta$-VAE, Factor-VAE, and $\beta$-TCVAE—on mel-spectrogram inputs, using a suite of metrics (MIG, SAP, JEMMIG, DCIMIG, Modularity) to assess recovery of the factors. The results reveal complementary strengths: the vanilla VAE achieves strong compactness (MIG, SAP) while Factor-VAE excels on modularity (JEMMIG, Modularity); $\beta$-VAE and $\beta$-TCVAE show limitations on several metrics, underscoring dataset-dependent behavior in audio. The SynTone benchmark provides a controlled testbed to guide future method development and the extension toward more diverse and real-world audio disentanglement tasks.

Abstract

This paper tackles the scarcity of benchmarking data in disentangled auditory representation learning. We introduce SynTone, a synthetic dataset with explicit ground truth explanatory factors for evaluating disentanglement techniques. Benchmarking state-of-the-art methods on SynTone highlights its utility for method evaluation. Our results underscore strengths and limitations in audio disentanglement, motivating future research.

Learning Disentangled Audio Representations through Controlled Synthesis

TL;DR

The paper tackles the lack of benchmarks for disentangled audio representations by introducing SynTone, a synthetic dataset with explicit ground-truth factors controlling timbre

, amplitude

, and frequency

. It benchmarks four VAE-based disentanglement methods—vanilla VAE,

-VAE, Factor-VAE, and

-TCVAE—on mel-spectrogram inputs, using a suite of metrics (MIG, SAP, JEMMIG, DCIMIG, Modularity) to assess recovery of the factors. The results reveal complementary strengths: the vanilla VAE achieves strong compactness (MIG, SAP) while Factor-VAE excels on modularity (JEMMIG, Modularity);

-VAE and

-TCVAE show limitations on several metrics, underscoring dataset-dependent behavior in audio. The SynTone benchmark provides a controlled testbed to guide future method development and the extension toward more diverse and real-world audio disentanglement tasks.

Abstract

Paper Structure (10 sections, 12 figures, 1 table)

This paper contains 10 sections, 12 figures, 1 table.

Introduction
Methodology
SynTone Dataset
Modeling
Experiments
Conclusion
Appendix
Model Input Preprocessing
Model Architecture
Reconstruction and Sampling

Figures (12)

Figure 1: Time-domain representation (left), Fourier transforms (middle), and Time-Frequency (Mel-spectrogram) representation (right) for original and reconstructed samples using the VAE model. This model shows a lower reconstruction quality as indicated in all plots due to the addition of noise frequencies around 7000Hz as well as lower harmonic partials as portrayed in the mel-spectrogram.
Figure 2: Visualization of generated audio samples using the VAE model. The left column depicts the time-domain representation, the middle column illustrates the Fourier transforms, and the right column showcases the Time-Frequency (Mel-spectrogram) representation. This plot shows that the model can isolate is single major frequency band close to 3000Hz as illustrated by the Fourier domain periodogram.
Figure 3: Time-domain representation (left), Fourier transforms (middle), and Time-Frequency (Mel-spectrogram) representation (right) for original and reconstructed samples using the $\beta$-VAE model. This model achieved a suitable reconstruction quality compared to the vanilla VAE as shown in both the Fourier space where the frequency factor of 4248.42Hz is correctly isolated.
Figure 4: Visualization of generated audio samples using the $\beta$-VAE model. The left column depicts the time-domain representation, the middle column illustrates the Fourier transforms, and the right column showcases the Time-Frequency (Mel-spectrogram) representation. This figure, however, shows minor harmonic partials close to 1000Hz while the major frequency component is also predominant in both the periodogram and mel-spectrogram plots.
Figure 5: Time-domain representation (left), Fourier transform (middle), and Time-Frequency (Mel-spectrogram) representation (right) for original and reconstructed samples using the FactorVAE model. This model achieved the best reconstruction quality for an assessment of audio representation in both the Fourier space where the frequency factor of 4248.42Hz is correctly isolated.
...and 7 more figures

Learning Disentangled Audio Representations through Controlled Synthesis

TL;DR

Abstract

Learning Disentangled Audio Representations through Controlled Synthesis

Authors

TL;DR

Abstract

Table of Contents

Figures (12)