Table of Contents
Fetching ...

STAGE: Stemmed Accompaniment Generation through Prefix-Based Conditioning

Giorgio Strano, Chiara Ballanti, Donato Crisostomi, Michele Mancusi, Luca Cosmo, Emanuele Rodolà

TL;DR

STAGE tackles iterative music composition by enabling single-stem accompaniment conditioned on an existing mix or a metronome, addressing the mismatch between scratch generation and real-world workflows. It fine-tunes MusicGen using a prefix-based context token to learn a token-to-token mapping between context and accompaniment, achieving coherent, rhythmically aligned outputs without extra encoders. The approach demonstrates strong beat-following and coherence (via COCOLA, FAD, and KAD metrics) and introduces tempo-conditioned generation through audio inputs, including metronome conditioning, while remaining parameter-efficient. This yields a practical tool for musicians that integrates seamlessly into production workflows and scales across instruments with lightweight fine-tuning and open-source resources.

Abstract

Recent advances in generative models have made it possible to create high-quality, coherent music, with some systems delivering production-level output. Yet, most existing models focus solely on generating music from scratch, limiting their usefulness for musicians who want to integrate such models into a human, iterative composition workflow. In this paper we introduce STAGE, our STemmed Accompaniment GEneration model, fine-tuned from the state-of-the-art MusicGen to generate single-stem instrumental accompaniments conditioned on a given mixture. Inspired by instruction-tuning methods for language models, we extend the transformer's embedding matrix with a context token, enabling the model to attend to a musical context through prefix-based conditioning. Compared to the baselines, STAGE yields accompaniments that exhibit stronger coherence with the input mixture, higher audio quality, and closer alignment with textual prompts. Moreover, by conditioning on a metronome-like track, our framework naturally supports tempo-constrained generation, achieving state-of-the-art alignment with the target rhythmic structure--all without requiring any additional tempo-specific module. As a result, STAGE offers a practical, versatile tool for interactive music creation that can be readily adopted by musicians in real-world workflows.

STAGE: Stemmed Accompaniment Generation through Prefix-Based Conditioning

TL;DR

STAGE tackles iterative music composition by enabling single-stem accompaniment conditioned on an existing mix or a metronome, addressing the mismatch between scratch generation and real-world workflows. It fine-tunes MusicGen using a prefix-based context token to learn a token-to-token mapping between context and accompaniment, achieving coherent, rhythmically aligned outputs without extra encoders. The approach demonstrates strong beat-following and coherence (via COCOLA, FAD, and KAD metrics) and introduces tempo-conditioned generation through audio inputs, including metronome conditioning, while remaining parameter-efficient. This yields a practical tool for musicians that integrates seamlessly into production workflows and scales across instruments with lightweight fine-tuning and open-source resources.

Abstract

Recent advances in generative models have made it possible to create high-quality, coherent music, with some systems delivering production-level output. Yet, most existing models focus solely on generating music from scratch, limiting their usefulness for musicians who want to integrate such models into a human, iterative composition workflow. In this paper we introduce STAGE, our STemmed Accompaniment GEneration model, fine-tuned from the state-of-the-art MusicGen to generate single-stem instrumental accompaniments conditioned on a given mixture. Inspired by instruction-tuning methods for language models, we extend the transformer's embedding matrix with a context token, enabling the model to attend to a musical context through prefix-based conditioning. Compared to the baselines, STAGE yields accompaniments that exhibit stronger coherence with the input mixture, higher audio quality, and closer alignment with textual prompts. Moreover, by conditioning on a metronome-like track, our framework naturally supports tempo-constrained generation, achieving state-of-the-art alignment with the target rhythmic structure--all without requiring any additional tempo-specific module. As a result, STAGE offers a practical, versatile tool for interactive music creation that can be readily adopted by musicians in real-world workflows.

Paper Structure

This paper contains 22 sections, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Outline of our proposed model. (top) STAGE takes a musical context as input and generates a single-stem accompaniment. (bottom) STAGE takes a metronome-like track and generates a stem that follows the desired rhythmic structure.
  • Figure 2: Illustration of the delay pattern used by MusicGen, and how the context token is placed to separate the audio context from the input sequence of the transformer.
  • Figure 3: Comparison of rhythmic alignment when passing only a mixture as conditioning vs. the combination of the same mixture with a metronome track. For STAGE-drums, the F1 alignment improves from $52.6$ to $64.0$, and for STAGE-bass from $40.9$ to $46.8$.