Table of Contents
Fetching ...

Subtractive Training for Music Stem Insertion using Latent Diffusion Models

Ivan Villa-Renteria, Mason L. Wang, Zachary Shah, Zhe Li, Soohyun Kim, Neelesh Ramachandran, Mert Pilanci

TL;DR

The paper tackles the problem of generating missing musical stems that coherently integrate with existing context by reframing stem insertion as spectrogram editing guided by text instructions. It introduces Subtractive Training, which uses triplets of (full-mix, stem-subtracted, edit instruction) and fine-tunes a pre-trained text-to-audio latent diffusion model to learn $p(oldsymbol{x}ig|oldsymbol{y},oldsymbol{x}_{partial})$, enabling context-aware insertion of stems. Empirically, the method yields realistic drum accompaniments and enables stem-wise style control, with the MIDI extension producing compatible bass, drum, and guitar parts. This approach offers a scalable, text-guided framework for editing full musical arrangements at the stem level, supporting creative rearrangements while preserving other instruments.

Abstract

We present Subtractive Training, a simple and novel method for synthesizing individual musical instrument stems given other instruments as context. This method pairs a dataset of complete music mixes with 1) a variant of the dataset lacking a specific stem, and 2) LLM-generated instructions describing how the missing stem should be reintroduced. We then fine-tune a pretrained text-to-audio diffusion model to generate the missing instrument stem, guided by both the existing stems and the text instruction. Our results demonstrate Subtractive Training's efficacy in creating authentic drum stems that seamlessly blend with the existing tracks. We also show that we can use the text instruction to control the generation of the inserted stem in terms of rhythm, dynamics, and genre, allowing us to modify the style of a single instrument in a full song while keeping the remaining instruments the same. Lastly, we extend this technique to MIDI formats, successfully generating compatible bass, drum, and guitar parts for incomplete arrangements.

Subtractive Training for Music Stem Insertion using Latent Diffusion Models

TL;DR

The paper tackles the problem of generating missing musical stems that coherently integrate with existing context by reframing stem insertion as spectrogram editing guided by text instructions. It introduces Subtractive Training, which uses triplets of (full-mix, stem-subtracted, edit instruction) and fine-tunes a pre-trained text-to-audio latent diffusion model to learn , enabling context-aware insertion of stems. Empirically, the method yields realistic drum accompaniments and enables stem-wise style control, with the MIDI extension producing compatible bass, drum, and guitar parts. This approach offers a scalable, text-guided framework for editing full musical arrangements at the stem level, supporting creative rearrangements while preserving other instruments.

Abstract

We present Subtractive Training, a simple and novel method for synthesizing individual musical instrument stems given other instruments as context. This method pairs a dataset of complete music mixes with 1) a variant of the dataset lacking a specific stem, and 2) LLM-generated instructions describing how the missing stem should be reintroduced. We then fine-tune a pretrained text-to-audio diffusion model to generate the missing instrument stem, guided by both the existing stems and the text instruction. Our results demonstrate Subtractive Training's efficacy in creating authentic drum stems that seamlessly blend with the existing tracks. We also show that we can use the text instruction to control the generation of the inserted stem in terms of rhythm, dynamics, and genre, allowing us to modify the style of a single instrument in a full song while keeping the remaining instruments the same. Lastly, we extend this technique to MIDI formats, successfully generating compatible bass, drum, and guitar parts for incomplete arrangements.
Paper Structure (23 sections, 5 figures, 1 table)

This paper contains 23 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: Latent Diffusion Model for Drum-Insertion.
  • Figure 2: Comparison of spectrograms before and after stem addition. The stem-subtracted spectrogram is an input to our model, while the generated spectrogram is the output from the edit instruction "Add rock-style drums."
  • Figure 3: Comparison of spectrograms before and after stem addition. The original genre of the song is reggae. Drums were inserted with the edit instruction "add jazzy drums."
  • Figure 4: Subjective comparison of our method (RiffInpaint) to the SDEdit baseline. Win percentages and counts are shown for each method/criterion.
  • Figure 5: Pitch rolls showing stem-generation results using two different diffusion models, each trained to output a given instrument. Notes corresponding to the generated instrument are outlined in black. Guitar generation examples can be found on the website.