MIDI-Informed Singing Accompaniment Generation in a Compositional Song Pipeline

Fang-Duo Tsai; Yi-An Lai; Fei-Yueh Chen; Hsueh-Wei Fu; Li Chai; Wei-Jaw Lee; Hao-Chung Cheng; Yi-Hsuan Yang

MIDI-Informed Singing Accompaniment Generation in a Compositional Song Pipeline

Fang-Duo Tsai, Yi-An Lai, Fei-Yueh Chen, Hsueh-Wei Fu, Li Chai, Wei-Jaw Lee, Hao-Chung Cheng, Yi-Hsuan Yang

TL;DR

Beyond conventional SAG settings that assume continuously sung vocals, compositional song generation features intermittent vocals; this work addresses this by combining explicit rhythmic/harmonic controls with audio continuation to keep the backing track consistent across vocal and non-vocal regions.

Abstract

Song generation aims to produce full songs with vocals and accompaniment from lyrics and text descriptions, yet end-to-end models remain data- and compute-intensive and provide limited editability. We advocate a compositional alternative that decomposes the task into melody composition, singing voice synthesis, and singing accompaniment generation. Central to our approach is MIDI-informed singing accompaniment generation (MIDI-SAG), which conditions accompaniment on the symbolic vocal-melody MIDI to improve rhythmic and harmonic alignment between singing and instrumentation. Moreover, beyond conventional SAG settings that assume continuously sung vocals, compositional song generation features intermittent vocals; we address this by combining explicit rhythmic/harmonic controls with audio continuation to keep the backing track consistent across vocal and non-vocal regions. With lightweight newly trained components requiring only 2.5k hours of audio on a single RTX 3090, our pipeline approaches the perceptual quality of recent open-source end-to-end baselines in several metrics. We provide audio demos and will open-source our model at https://composerflow.github.io/web/.

MIDI-Informed Singing Accompaniment Generation in a Compositional Song Pipeline

TL;DR

Abstract

Paper Structure (29 sections, 2 equations, 7 figures, 10 tables)

This paper contains 29 sections, 2 equations, 7 figures, 10 tables.

Introduction
Related Work
Compositional Song Generation Framework
Melody Composition (T1) & Singing Synthesis (T2)
MIDI-informed Accompaniment Generation (T3)
Proposed Techniques for Structural Completeness
MIDI-SAG without Ground-truth Vocal MIDI
System Implementation
Inference Process of the Proposed Pipeline
Experimental Setup
Experimental Results
Experiment 1: Short-form (10s) SAG
Experiment 2: Long-form (47s) SAG
Experiment 3: Long-Form Song Generation
Ablation Study on Conditioning Signals
...and 14 more sections

Figures (7)

Figure 1: Overview of the compositional song generation pipeline. The system sequentially maps lyrics to a full song through: (1) Melody Composition (CSL-L2M), (2) SVS (FastSpeech-based), (3) Melody Harmonization (AccoMontage2), and (4) the proposed MIDI-SAG, which adapts MuseControlLite to incorporate symbolic and acoustic conditioning for final accompaniment synthesis.
Figure 2: Architectural comparison of SAG variants: (a) Conventional audio-SAG; (b) the proposed MIDI-SAG with ground-truth vocal MIDI score (Section \ref{['SAG']}); (c) the MIDI-SAG variant that uses automatically extracted MIDI representation (Section \ref{['SAG2']}).
Figure 3: Comparison of rhythmic stability. White stripes represent the predicted beat positions from the generated accompaniment. While (a) audio-SAG loses rhythmic consistency in non-vocal segments, (b) MIDI-SAG maintains stable beat and coherent content.
Figure 4: The data-preprocessing pipeline to curate data for fine-tuing Stable Audio Open to implement our MIDI-informed singing accompaniment generation (MIDI-SAG) model.
Figure 5: Augmenting Stable Audio Open for singing accompaniment and audio continuation. The architecture utilizes the MuseControlLite framework to integrate multi-modal conditioning signals, enabling precise singing-accompaniment alignment and seamless long-form audio continuation.
...and 2 more figures

MIDI-Informed Singing Accompaniment Generation in a Compositional Song Pipeline

TL;DR

Abstract

MIDI-Informed Singing Accompaniment Generation in a Compositional Song Pipeline

Authors

TL;DR

Abstract

Table of Contents

Figures (7)