Table of Contents
Fetching ...

MIDI-Informed Singing Accompaniment Generation in a Compositional Song Pipeline

Fang-Duo Tsai, Yi-An Lai, Fei-Yueh Chen, Hsueh-Wei Fu, Li Chai, Wei-Jaw Lee, Hao-Chung Cheng, Yi-Hsuan Yang

TL;DR

Beyond conventional SAG settings that assume continuously sung vocals, compositional song generation features intermittent vocals; this work addresses this by combining explicit rhythmic/harmonic controls with audio continuation to keep the backing track consistent across vocal and non-vocal regions.

Abstract

Song generation aims to produce full songs with vocals and accompaniment from lyrics and text descriptions, yet end-to-end models remain data- and compute-intensive and provide limited editability. We advocate a compositional alternative that decomposes the task into melody composition, singing voice synthesis, and singing accompaniment generation. Central to our approach is MIDI-informed singing accompaniment generation (MIDI-SAG), which conditions accompaniment on the symbolic vocal-melody MIDI to improve rhythmic and harmonic alignment between singing and instrumentation. Moreover, beyond conventional SAG settings that assume continuously sung vocals, compositional song generation features intermittent vocals; we address this by combining explicit rhythmic/harmonic controls with audio continuation to keep the backing track consistent across vocal and non-vocal regions. With lightweight newly trained components requiring only 2.5k hours of audio on a single RTX 3090, our pipeline approaches the perceptual quality of recent open-source end-to-end baselines in several metrics. We provide audio demos and will open-source our model at https://composerflow.github.io/web/.

MIDI-Informed Singing Accompaniment Generation in a Compositional Song Pipeline

TL;DR

Beyond conventional SAG settings that assume continuously sung vocals, compositional song generation features intermittent vocals; this work addresses this by combining explicit rhythmic/harmonic controls with audio continuation to keep the backing track consistent across vocal and non-vocal regions.

Abstract

Song generation aims to produce full songs with vocals and accompaniment from lyrics and text descriptions, yet end-to-end models remain data- and compute-intensive and provide limited editability. We advocate a compositional alternative that decomposes the task into melody composition, singing voice synthesis, and singing accompaniment generation. Central to our approach is MIDI-informed singing accompaniment generation (MIDI-SAG), which conditions accompaniment on the symbolic vocal-melody MIDI to improve rhythmic and harmonic alignment between singing and instrumentation. Moreover, beyond conventional SAG settings that assume continuously sung vocals, compositional song generation features intermittent vocals; we address this by combining explicit rhythmic/harmonic controls with audio continuation to keep the backing track consistent across vocal and non-vocal regions. With lightweight newly trained components requiring only 2.5k hours of audio on a single RTX 3090, our pipeline approaches the perceptual quality of recent open-source end-to-end baselines in several metrics. We provide audio demos and will open-source our model at https://composerflow.github.io/web/.
Paper Structure (29 sections, 2 equations, 7 figures, 10 tables)

This paper contains 29 sections, 2 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: Overview of the compositional song generation pipeline. The system sequentially maps lyrics to a full song through: (1) Melody Composition (CSL-L2M), (2) SVS (FastSpeech-based), (3) Melody Harmonization (AccoMontage2), and (4) the proposed MIDI-SAG, which adapts MuseControlLite to incorporate symbolic and acoustic conditioning for final accompaniment synthesis.
  • Figure 2: Architectural comparison of SAG variants: (a) Conventional audio-SAG; (b) the proposed MIDI-SAG with ground-truth vocal MIDI score (Section \ref{['SAG']}); (c) the MIDI-SAG variant that uses automatically extracted MIDI representation (Section \ref{['SAG2']}).
  • Figure 3: Comparison of rhythmic stability. White stripes represent the predicted beat positions from the generated accompaniment. While (a) audio-SAG loses rhythmic consistency in non-vocal segments, (b) MIDI-SAG maintains stable beat and coherent content.
  • Figure 4: The data-preprocessing pipeline to curate data for fine-tuing Stable Audio Open to implement our MIDI-informed singing accompaniment generation (MIDI-SAG) model.
  • Figure 5: Augmenting Stable Audio Open for singing accompaniment and audio continuation. The architecture utilizes the MuseControlLite framework to integrate multi-modal conditioning signals, enabling precise singing-accompaniment alignment and seamless long-form audio continuation.
  • ...and 2 more figures