Table of Contents
Fetching ...

MusiConGen: Rhythm and Chord Control for Transformer-Based Text-to-Music Generation

Yun-Han Lan, Wen-Yi Hsiao, Hao-Chung Cheng, Yi-Hsuan Yang

TL;DR

MusiConGen addresses the lack of precise temporal control in text-to-music generation by introducing temporally-conditioned, chord-and-rhythm controls into a Transformer-based pipeline built on MusicGen. The method uses two complementary chord representations and a beat/downbeat rhythm signal, combined with jump finetuning and adaptive in-attention to enable efficient adaptation on consumer GPUs. Evaluations on MUSDB18 and RWC-pop-100 show improved alignment with specified rhythmic and harmonic conditions, with subjective tests confirming enhanced chord controllability. The work provides an open-source implementation and data flow that supports conditioning from either reference audio features or symbolic inputs, enabling practical, user-guided backing-track generation.

Abstract

Existing text-to-music models can produce high-quality audio with great diversity. However, textual prompts alone cannot precisely control temporal musical features such as chords and rhythm of the generated music. To address this challenge, we introduce MusiConGen, a temporally-conditioned Transformer-based text-to-music model that builds upon the pretrained MusicGen framework. Our innovation lies in an efficient finetuning mechanism, tailored for consumer-grade GPUs, that integrates automatically-extracted rhythm and chords as the condition signal. During inference, the condition can either be musical features extracted from a reference audio signal, or be user-defined symbolic chord sequence, BPM, and textual prompts. Our performance evaluation on two datasets -- one derived from extracted features and the other from user-created inputs -- demonstrates that MusiConGen can generate realistic backing track music that aligns well with the specified conditions. We open-source the code and model checkpoints, and provide audio examples online, https://musicongen.github.io/musicongen_demo/.

MusiConGen: Rhythm and Chord Control for Transformer-Based Text-to-Music Generation

TL;DR

MusiConGen addresses the lack of precise temporal control in text-to-music generation by introducing temporally-conditioned, chord-and-rhythm controls into a Transformer-based pipeline built on MusicGen. The method uses two complementary chord representations and a beat/downbeat rhythm signal, combined with jump finetuning and adaptive in-attention to enable efficient adaptation on consumer GPUs. Evaluations on MUSDB18 and RWC-pop-100 show improved alignment with specified rhythmic and harmonic conditions, with subjective tests confirming enhanced chord controllability. The work provides an open-source implementation and data flow that supports conditioning from either reference audio features or symbolic inputs, enabling practical, user-guided backing-track generation.

Abstract

Existing text-to-music models can produce high-quality audio with great diversity. However, textual prompts alone cannot precisely control temporal musical features such as chords and rhythm of the generated music. To address this challenge, we introduce MusiConGen, a temporally-conditioned Transformer-based text-to-music model that builds upon the pretrained MusicGen framework. Our innovation lies in an efficient finetuning mechanism, tailored for consumer-grade GPUs, that integrates automatically-extracted rhythm and chords as the condition signal. During inference, the condition can either be musical features extracted from a reference audio signal, or be user-defined symbolic chord sequence, BPM, and textual prompts. Our performance evaluation on two datasets -- one derived from extracted features and the other from user-created inputs -- demonstrates that MusiConGen can generate realistic backing track music that aligns well with the specified conditions. We open-source the code and model checkpoints, and provide audio examples online, https://musicongen.github.io/musicongen_demo/.
Paper Structure (20 sections, 4 figures, 3 tables)

This paper contains 20 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: The model structure of MusiConGen and the self-attention block. a) MusiConGen takes text $\mathcal{T}$, downsampled chord $\mathcal{C}_{pre}$ as prepended condition and frame-wise chord $\mathcal{C}_{sum}$ and rhythm $\mathcal{R}$ as additive condition. The addition operation of frame-wise conditions to each self-attention block is regulated by the condition gate control ($\otimes$). b) Each self-attention block consists of four layers. In our proposed model, only the first layer is finetuned, which is also called jump finetuning.
  • Figure 2: Comparison on chord progression and beats of ground truth and generated samples, using the conditions from RWC. For each example (a) or (b), the top row is ground truth chords and the bottom row is extracted chords from generated samples. The thick and light gray lines indicate the times of the downbeat and the beat, respectively.
  • Figure 3: Subjective evaluation of condition controls--- 5-scale mean opinion score with 95% confidence interval.
  • Figure :