Table of Contents
Fetching ...

Stemphonic: All-at-once Flexible Multi-stem Music Generation

Shih-Lun Wu, Ge Zhu, Juan-Pablo Caceres, Cheng-Zhi Anna Huang, Nicholas J. Bryan

TL;DR

Stemphonic addresses the bottleneck of editing and composing from stem-level audio by coupling latent diffusion/flow with stem grouping and per-group noise sharing to produce a variable set of musically synchronized stems in one pass. It integrates conditional multi-stem generation and stem-wise activity controls, enabling from-scratch or context-driven outputs with precise temporal control. The approach builds on a DiT Transformer trained on stem latents with a rectified flow objective and a probability-flow ODE sampler, and demonstrates improved mix quality alongside 25–50% faster full-mix generation on MoisesDB and MusDB. This framework offers a practical, fast, and controllable pathway for composers to assemble and iterate stem-based mixes from open prompts and existing material.

Abstract

Music stem generation, the task of producing musically-synchronized and isolated instrument audio clips, offers the potential of greater user control and better alignment with musician workflows compared to conventional text-to-music models. Existing stem generation approaches, however, either rely on fixed architectures that output a predefined set of stems in parallel, or generate only one stem at a time, resulting in slow inference despite flexibility in stem combination. We propose Stemphonic, a diffusion-/flow-based framework that overcomes this trade-off and generates a variable set of synchronized stems in one inference pass. During training, we treat each stem as a batch element, group synchronized stems in a batch, and apply a shared noise latent to each group. At inference-time, we use a shared initial noise latent and stem-specific text inputs to generate synchronized multi-stem outputs in one pass. We further expand our approach to enable one-pass conditional multi-stem generation and stem-wise activity controls to empower users to iteratively generate and orchestrate the temporal layering of a mix. We benchmark our results on multiple open-source stem evaluation sets and show that Stemphonic produces higher-quality outputs while accelerating the full mix generation process by 25 to 50%. Demos at: https://stemphonic-demo.vercel.app.

Stemphonic: All-at-once Flexible Multi-stem Music Generation

TL;DR

Stemphonic addresses the bottleneck of editing and composing from stem-level audio by coupling latent diffusion/flow with stem grouping and per-group noise sharing to produce a variable set of musically synchronized stems in one pass. It integrates conditional multi-stem generation and stem-wise activity controls, enabling from-scratch or context-driven outputs with precise temporal control. The approach builds on a DiT Transformer trained on stem latents with a rectified flow objective and a probability-flow ODE sampler, and demonstrates improved mix quality alongside 25–50% faster full-mix generation on MoisesDB and MusDB. This framework offers a practical, fast, and controllable pathway for composers to assemble and iterate stem-based mixes from open prompts and existing material.

Abstract

Music stem generation, the task of producing musically-synchronized and isolated instrument audio clips, offers the potential of greater user control and better alignment with musician workflows compared to conventional text-to-music models. Existing stem generation approaches, however, either rely on fixed architectures that output a predefined set of stems in parallel, or generate only one stem at a time, resulting in slow inference despite flexibility in stem combination. We propose Stemphonic, a diffusion-/flow-based framework that overcomes this trade-off and generates a variable set of synchronized stems in one inference pass. During training, we treat each stem as a batch element, group synchronized stems in a batch, and apply a shared noise latent to each group. At inference-time, we use a shared initial noise latent and stem-specific text inputs to generate synchronized multi-stem outputs in one pass. We further expand our approach to enable one-pass conditional multi-stem generation and stem-wise activity controls to empower users to iteratively generate and orchestrate the temporal layering of a mix. We benchmark our results on multiple open-source stem evaluation sets and show that Stemphonic produces higher-quality outputs while accelerating the full mix generation process by 25 to 50%. Demos at: https://stemphonic-demo.vercel.app.
Paper Structure (14 sections, 2 equations, 1 figure, 3 tables)

This paper contains 14 sections, 2 equations, 1 figure, 3 tables.

Figures (1)

  • Figure 1: Our Stemphonic framework for flexible multi-stem music generation. (Top) At training, each group of synchronized stems share the same noise latent. (Bottom) At inference, we use a shared initial noise to generate variable multi-stem outputs in one pass. We also enable conditional stem generation and stem-wise activity controls.