DisMix: Disentangling Mixtures of Musical Instruments for Source-level Pitch and Timbre Manipulation

Yin-Jyun Luo; Kin Wai Cheuk; Woosung Choi; Toshimitsu Uesaka; Keisuke Toyama; Koichi Saito; Chieh-Hsin Lai; Yuhta Takida; Wei-Hsiang Liao; Simon Dixon; Yuki Mitsufuji

DisMix: Disentangling Mixtures of Musical Instruments for Source-level Pitch and Timbre Manipulation

Yin-Jyun Luo, Kin Wai Cheuk, Woosung Choi, Toshimitsu Uesaka, Keisuke Toyama, Koichi Saito, Chieh-Hsin Lai, Yuhta Takida, Wei-Hsiang Liao, Simon Dixon, Yuki Mitsufuji

TL;DR

DisMix tackles pitch-timbre disentanglement in multi-instrument mixtures by representing each instrument with a source-level pair $(\nu^{(i)},\tau^{(i)})$ and conditioning a decoder on a set of source-level representations $\mathcal{S}=\{s^{(i)}\}$. It presents two instantiations: a simple auto-encoder validating core disentanglement and a latent diffusion model with a Diffusion Transformer (DiT) that handles realistic Bach chorales; both incorporate a binarisation layer for pitch and a Gaussian prior for timbre, optimized via an ELBO with auxiliary losses. The approach enables compositional manipulation, such as swapping instruments or altering melodies, demonstrated through qualitative and quantitative evaluations including disentanglement metrics and Fréchet Audio Distance. The Latent Diffusion Model variant achieves strong instrument-pitch disentanglement and high-quality audio with set-based conditioning, indicating practical potential for controllable multi-instrument synthesis without heavy reliance on external source separation. Overall, DisMix advances modular, object-centric representations for music synthesis, enabling flexible, attribute-level mixing and transformation of complex ensembles.

Abstract

Existing work on pitch and timbre disentanglement has been mostly focused on single-instrument music audio, excluding the cases where multiple instruments are presented. To fill the gap, we propose DisMix, a generative framework in which the pitch and timbre representations act as modular building blocks for constructing the melody and instrument of a source, and the collection of which forms a set of per-instrument latent representations underlying the observed mixture. By manipulating the representations, our model samples mixtures with novel combinations of pitch and timbre of the constituent instruments. We can jointly learn the disentangled pitch-timbre representations and a latent diffusion transformer that reconstructs the mixture conditioned on the set of source-level representations. We evaluate the model using both a simple dataset of isolated chords and a realistic four-part chorales in the style of J.S. Bach, identify the key components for the success of disentanglement, and demonstrate the application of mixture transformation based on source-level attribute manipulation.

DisMix: Disentangling Mixtures of Musical Instruments for Source-level Pitch and Timbre Manipulation

TL;DR

DisMix tackles pitch-timbre disentanglement in multi-instrument mixtures by representing each instrument with a source-level pair

and conditioning a decoder on a set of source-level representations

. It presents two instantiations: a simple auto-encoder validating core disentanglement and a latent diffusion model with a Diffusion Transformer (DiT) that handles realistic Bach chorales; both incorporate a binarisation layer for pitch and a Gaussian prior for timbre, optimized via an ELBO with auxiliary losses. The approach enables compositional manipulation, such as swapping instruments or altering melodies, demonstrated through qualitative and quantitative evaluations including disentanglement metrics and Fréchet Audio Distance. The Latent Diffusion Model variant achieves strong instrument-pitch disentanglement and high-quality audio with set-based conditioning, indicating practical potential for controllable multi-instrument synthesis without heavy reliance on external source separation. Overall, DisMix advances modular, object-centric representations for music synthesis, enabling flexible, attribute-level mixing and transformation of complex ensembles.

Abstract

Paper Structure (56 sections, 11 equations, 5 figures, 7 tables)

This paper contains 56 sections, 11 equations, 5 figures, 7 tables.

Introduction
Related Work
Pitch and Timbre Disentanglement
Object-Centric Representation Learning
DisMix: The Proposed Framework
Mixture and Query Encoders
Pitch and Timbre Encoders
Constraining Pitch Latents
Constraining Timbre Latents
Training Objectives
ELBO
Pitch Supervision
Barlow Twins
The Final Objective
A Simple Case Study
...and 41 more sections

Figures (5)

Figure 1: A mixture $x_m$ of $N_s$ instruments is represented by a set of source-level latents $\{ s^{(i)} \}_{i=1}^{N_s}$ integrating latents of pitch $\nu^{(i)}$ and timbre $\tau^{(i)}$. Diamond nodes denote deterministic mappings.
Figure 2: Left: PCA of the timbre space. Top: DisMix, plot $\tau^{(i)}$. Mid and bottom: Remove $\mathcal{L}_{\mathrm{BT}}$, plot the mean of $q_{\phi_\tau}(\tau^{(i)})$ and the sampling, respectively. Right: Novel mixture rendering. Refer to Section \ref{['sec:mixture-gen']} for details.
Figure 3: Left: PCA of the timbre space. Right: Compositional mixture rendering is achieved by modifying the members of a set of source-level latents.
Figure 4: Replacing instruments of a reference mixture (the top row) given a target mixture (the bottom row).
Figure 5: A regular post-norm Transformer block.

DisMix: Disentangling Mixtures of Musical Instruments for Source-level Pitch and Timbre Manipulation

TL;DR

Abstract

DisMix: Disentangling Mixtures of Musical Instruments for Source-level Pitch and Timbre Manipulation

Authors

TL;DR

Abstract

Table of Contents

Figures (5)