DisMix: Disentangling Mixtures of Musical Instruments for Source-level Pitch and Timbre Manipulation
Yin-Jyun Luo, Kin Wai Cheuk, Woosung Choi, Toshimitsu Uesaka, Keisuke Toyama, Koichi Saito, Chieh-Hsin Lai, Yuhta Takida, Wei-Hsiang Liao, Simon Dixon, Yuki Mitsufuji
TL;DR
DisMix tackles pitch-timbre disentanglement in multi-instrument mixtures by representing each instrument with a source-level pair $(\nu^{(i)},\tau^{(i)})$ and conditioning a decoder on a set of source-level representations $\mathcal{S}=\{s^{(i)}\}$. It presents two instantiations: a simple auto-encoder validating core disentanglement and a latent diffusion model with a Diffusion Transformer (DiT) that handles realistic Bach chorales; both incorporate a binarisation layer for pitch and a Gaussian prior for timbre, optimized via an ELBO with auxiliary losses. The approach enables compositional manipulation, such as swapping instruments or altering melodies, demonstrated through qualitative and quantitative evaluations including disentanglement metrics and Fréchet Audio Distance. The Latent Diffusion Model variant achieves strong instrument-pitch disentanglement and high-quality audio with set-based conditioning, indicating practical potential for controllable multi-instrument synthesis without heavy reliance on external source separation. Overall, DisMix advances modular, object-centric representations for music synthesis, enabling flexible, attribute-level mixing and transformation of complex ensembles.
Abstract
Existing work on pitch and timbre disentanglement has been mostly focused on single-instrument music audio, excluding the cases where multiple instruments are presented. To fill the gap, we propose DisMix, a generative framework in which the pitch and timbre representations act as modular building blocks for constructing the melody and instrument of a source, and the collection of which forms a set of per-instrument latent representations underlying the observed mixture. By manipulating the representations, our model samples mixtures with novel combinations of pitch and timbre of the constituent instruments. We can jointly learn the disentangled pitch-timbre representations and a latent diffusion transformer that reconstructs the mixture conditioned on the set of source-level representations. We evaluate the model using both a simple dataset of isolated chords and a realistic four-part chorales in the style of J.S. Bach, identify the key components for the success of disentanglement, and demonstrate the application of mixture transformation based on source-level attribute manipulation.
