Evaluating Disentangled Representations for Controllable Music Generation
Laura Ibáñez-Martínez, Chukwuemeka Nkama, Andrea Poltronieri, Xavier Serra, Martín Rocamora
TL;DR
The paper tackles how to reliably evaluate disentangled structure–timbre representations for controllable music generation, addressing a gap where evaluation often focuses on outputs rather than learned embeddings. It adapts the synesis representation-probing framework to music, comparing SS-VQ-VAE, TS-DSAE, and AFTER under targeted disentanglement strategies and ablations, using Slakh2100 for training and SynTheory-based probes. The findings show high mutual information between structure and timbre embeddings, asymmetric leakage, and tempo information encoded in timbre representations, pointing to incomplete disentanglement and limitations for controllability. These results motivate rethinking controllability in music generation, highlighting the need to pair representation diagnostics with decoder-level and perceptual assessments and to refine strategies that encourage true factor separation.
Abstract
Recent approaches in music generation rely on disentangled representations, often labeled as structure and timbre or local and global, to enable controllable synthesis. Yet the underlying properties of these embeddings remain underexplored. In this work, we evaluate such disentangled representations in a set of music audio models for controllable generation using a probing-based framework that goes beyond standard downstream tasks. The selected models reflect diverse unsupervised disentanglement strategies, including inductive biases, data augmentations, adversarial objectives, and staged training procedures. We further isolate specific strategies to analyze their effect. Our analysis spans four key axes: informativeness, equivariance, invariance, and disentanglement, which are assessed across datasets, tasks, and controlled transformations. Our findings reveal inconsistencies between intended and actual semantics of the embeddings, suggesting that current strategies fall short of producing truly disentangled representations, and prompting a re-examination of how controllability is approached in music generation.
