Table of Contents
Fetching ...

MusicGen-Stem: Multi-stem music generation and edition through autoregressive modeling

Simon Rouard, Robin San Roman, Yossi Adi, Axel Roebel

TL;DR

MusicGen-Stem addresses the need for flexible, stem-level control in music generation by introducing per-stem tokenization with specialized compressors and a multi-stream autoregressive transformer that can generate bass, drums, and other stems in parallel. The approach supports text- and audio-conditioned generation as well as stem editing and stem-by-stem iteration, enabling targeted modifications without full regeneration. Evaluations show competitive text-conditioned generation with prior models and superior stem-editing performance against baselines on internal instrumental data separated via Demucs. The work advances practical music creation by enabling coherent stem edits and iterative composition, though it is currently limited to three stems due to data constraints, with future work focusing on improving bass token quality and richer conditioning for the remaining stems.

Abstract

While most music generation models generate a mixture of stems (in mono or stereo), we propose to train a multi-stem generative model with 3 stems (bass, drums and other) that learn the musical dependencies between them. To do so, we train one specialized compression algorithm per stem to tokenize the music into parallel streams of tokens. Then, we leverage recent improvements in the task of music source separation to train a multi-stream text-to-music language model on a large dataset. Finally, thanks to a particular conditioning method, our model is able to edit bass, drums or other stems on existing or generated songs as well as doing iterative composition (e.g. generating bass on top of existing drums). This gives more flexibility in music generation algorithms and it is to the best of our knowledge the first open-source multi-stem autoregressive music generation model that can perform good quality generation and coherent source editing. Code and model weights will be released and samples are available on https://simonrouard.github.io/musicgenstem/.

MusicGen-Stem: Multi-stem music generation and edition through autoregressive modeling

TL;DR

MusicGen-Stem addresses the need for flexible, stem-level control in music generation by introducing per-stem tokenization with specialized compressors and a multi-stream autoregressive transformer that can generate bass, drums, and other stems in parallel. The approach supports text- and audio-conditioned generation as well as stem editing and stem-by-stem iteration, enabling targeted modifications without full regeneration. Evaluations show competitive text-conditioned generation with prior models and superior stem-editing performance against baselines on internal instrumental data separated via Demucs. The work advances practical music creation by enabling coherent stem edits and iterative composition, though it is currently limited to three stems due to data constraints, with future work focusing on improving bass token quality and richer conditioning for the remaining stems.

Abstract

While most music generation models generate a mixture of stems (in mono or stereo), we propose to train a multi-stem generative model with 3 stems (bass, drums and other) that learn the musical dependencies between them. To do so, we train one specialized compression algorithm per stem to tokenize the music into parallel streams of tokens. Then, we leverage recent improvements in the task of music source separation to train a multi-stream text-to-music language model on a large dataset. Finally, thanks to a particular conditioning method, our model is able to edit bass, drums or other stems on existing or generated songs as well as doing iterative composition (e.g. generating bass on top of existing drums). This gives more flexibility in music generation algorithms and it is to the best of our knowledge the first open-source multi-stem autoregressive music generation model that can perform good quality generation and coherent source editing. Code and model weights will be released and samples are available on https://simonrouard.github.io/musicgenstem/.
Paper Structure (15 sections, 3 figures, 2 tables)

This paper contains 15 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Three use-cases of our model: (up) MusicGen-Stem can perform text-to-music generation and generates parallel streams of tokens representing the 3 stems (bass, drums and other). (down) MusicGen-Stem can also perform stem editing: given a subgroup of stems, the model can generate the complementary ones with an optional text prompt. (right) Given the waveform of one or multiple stems (that can be extracted from an existing song with Demucs), we tokenize them and MusicGen-Stem can generate the missing stems with an optional text prompt. We can then decode them.
  • Figure 2: Training pipeline. Given a song paired with its textual description, we process the song by using the source separation model Demucs and tokenize each stem with specific compression models. There is one stream of token for the bass as well as the drums and 4 streams of tokens for the other instruments. Then, these tokens as well as the encoded textual description are fed into MusicGen-Stem's autoregressive transformer which is trained with a cross-entropy loss.
  • Figure 3: Training the editing task. Here the drums and the 2 last streams of the other stem are masked. The cross-entropy loss is computed on the tokens on the right of the masked tokens.