Table of Contents
Fetching ...

MuseBarControl: Enhancing Fine-Grained Control in Symbolic Music Generation through Pre-Training and Counterfactual Loss

Yangyang Shu, Haiming Xu, Ziqin Zhou, Anton van den Hengel, Lingqiao Liu

TL;DR

This work addresses the lack of fine-grained bar-level controllability in symbolic music generation by extending a foundation model with bar-specific prompts. MuseBarControl introduces three components: Control Prompt Augmentation to encode per-bar attributes, an auxiliary-task pre-training (PA) to align prompts with music tokens, and a counterfactual loss (CF) to enforce responsiveness to control prompts, culminating in a final objective $L = L_{BFT} + \lambda L_{CF}$. Empirically, it yields a $13.06\%$ improvement in bar-level chord accuracy over the MuseCoco baseline while preserving musicality, demonstrated on the POP909 dataset with bar-level chord and global attribute evaluations, and supported by human judgments. The approach enables bar-level edits and style mimicry, with potential extension to additional bar-level attributes beyond chords, and suggests a path toward user-friendly text-based bar-level control in future work.

Abstract

Automatically generating symbolic music-music scores tailored to specific human needs-can be highly beneficial for musicians and enthusiasts. Recent studies have shown promising results using extensive datasets and advanced transformer architectures. However, these state-of-the-art models generally offer only basic control over aspects like tempo and style for the entire composition, lacking the ability to manage finer details, such as control at the level of individual bars. While fine-tuning a pre-trained symbolic music generation model might seem like a straightforward method for achieving this finer control, our research indicates challenges in this approach. The model often fails to respond adequately to new, fine-grained bar-level control signals. To address this, we propose two innovative solutions. First, we introduce a pre-training task designed to link control signals directly with corresponding musical tokens, which helps in achieving a more effective initialization for subsequent fine-tuning. Second, we implement a novel counterfactual loss that promotes better alignment between the generated music and the control prompts. Together, these techniques significantly enhance our ability to control music generation at the bar level, showing a 13.06\% improvement over conventional methods. Our subjective evaluations also confirm that this enhanced control does not compromise the musical quality of the original pre-trained generative model.

MuseBarControl: Enhancing Fine-Grained Control in Symbolic Music Generation through Pre-Training and Counterfactual Loss

TL;DR

This work addresses the lack of fine-grained bar-level controllability in symbolic music generation by extending a foundation model with bar-specific prompts. MuseBarControl introduces three components: Control Prompt Augmentation to encode per-bar attributes, an auxiliary-task pre-training (PA) to align prompts with music tokens, and a counterfactual loss (CF) to enforce responsiveness to control prompts, culminating in a final objective . Empirically, it yields a improvement in bar-level chord accuracy over the MuseCoco baseline while preserving musicality, demonstrated on the POP909 dataset with bar-level chord and global attribute evaluations, and supported by human judgments. The approach enables bar-level edits and style mimicry, with potential extension to additional bar-level attributes beyond chords, and suggests a path toward user-friendly text-based bar-level control in future work.

Abstract

Automatically generating symbolic music-music scores tailored to specific human needs-can be highly beneficial for musicians and enthusiasts. Recent studies have shown promising results using extensive datasets and advanced transformer architectures. However, these state-of-the-art models generally offer only basic control over aspects like tempo and style for the entire composition, lacking the ability to manage finer details, such as control at the level of individual bars. While fine-tuning a pre-trained symbolic music generation model might seem like a straightforward method for achieving this finer control, our research indicates challenges in this approach. The model often fails to respond adequately to new, fine-grained bar-level control signals. To address this, we propose two innovative solutions. First, we introduce a pre-training task designed to link control signals directly with corresponding musical tokens, which helps in achieving a more effective initialization for subsequent fine-tuning. Second, we implement a novel counterfactual loss that promotes better alignment between the generated music and the control prompts. Together, these techniques significantly enhance our ability to control music generation at the bar level, showing a 13.06\% improvement over conventional methods. Our subjective evaluations also confirm that this enhanced control does not compromise the musical quality of the original pre-trained generative model.
Paper Structure (34 sections, 6 equations, 7 figures, 5 tables)

This paper contains 34 sections, 6 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: An example of three pop songs sharing the same chord progression. The top row displays five-column chords, while the bottom three rows represent the pop songs "Canon" by Pachelbel, "Far Away" by Jay Chou, and "Absolute Obsession" by Sam Lee.
  • Figure 2: Control prompt augmentation.
  • Figure 4: The vote percentages of music generated by MuseCoco and our method, as judged by 16 piano teachers for similarity to human creation.
  • Figure 6: Two generated examples using "Canon chord progression" with different global attribute controls.
  • Figure 7: Some good cases of generated piano rolls, where the chords in the generated music perfectly align with the prompts.
  • ...and 2 more figures