Learning Disentangled Audio Representations through Controlled Synthesis
Yusuf Brima, Ulf Krumnack, Simone Pika, Gunther Heidemann
TL;DR
The paper tackles the lack of benchmarks for disentangled audio representations by introducing SynTone, a synthetic dataset with explicit ground-truth factors controlling timbre $\mathbb{T}$, amplitude $\mathbb{A}$, and frequency $\mathbb{F}$. It benchmarks four VAE-based disentanglement methods—vanilla VAE, $\beta$-VAE, Factor-VAE, and $\beta$-TCVAE—on mel-spectrogram inputs, using a suite of metrics (MIG, SAP, JEMMIG, DCIMIG, Modularity) to assess recovery of the factors. The results reveal complementary strengths: the vanilla VAE achieves strong compactness (MIG, SAP) while Factor-VAE excels on modularity (JEMMIG, Modularity); $\beta$-VAE and $\beta$-TCVAE show limitations on several metrics, underscoring dataset-dependent behavior in audio. The SynTone benchmark provides a controlled testbed to guide future method development and the extension toward more diverse and real-world audio disentanglement tasks.
Abstract
This paper tackles the scarcity of benchmarking data in disentangled auditory representation learning. We introduce SynTone, a synthetic dataset with explicit ground truth explanatory factors for evaluating disentanglement techniques. Benchmarking state-of-the-art methods on SynTone highlights its utility for method evaluation. Our results underscore strengths and limitations in audio disentanglement, motivating future research.
