MuseBarControl: Enhancing Fine-Grained Control in Symbolic Music Generation through Pre-Training and Counterfactual Loss
Yangyang Shu, Haiming Xu, Ziqin Zhou, Anton van den Hengel, Lingqiao Liu
TL;DR
This work addresses the lack of fine-grained bar-level controllability in symbolic music generation by extending a foundation model with bar-specific prompts. MuseBarControl introduces three components: Control Prompt Augmentation to encode per-bar attributes, an auxiliary-task pre-training (PA) to align prompts with music tokens, and a counterfactual loss (CF) to enforce responsiveness to control prompts, culminating in a final objective $L = L_{BFT} + \lambda L_{CF}$. Empirically, it yields a $13.06\%$ improvement in bar-level chord accuracy over the MuseCoco baseline while preserving musicality, demonstrated on the POP909 dataset with bar-level chord and global attribute evaluations, and supported by human judgments. The approach enables bar-level edits and style mimicry, with potential extension to additional bar-level attributes beyond chords, and suggests a path toward user-friendly text-based bar-level control in future work.
Abstract
Automatically generating symbolic music-music scores tailored to specific human needs-can be highly beneficial for musicians and enthusiasts. Recent studies have shown promising results using extensive datasets and advanced transformer architectures. However, these state-of-the-art models generally offer only basic control over aspects like tempo and style for the entire composition, lacking the ability to manage finer details, such as control at the level of individual bars. While fine-tuning a pre-trained symbolic music generation model might seem like a straightforward method for achieving this finer control, our research indicates challenges in this approach. The model often fails to respond adequately to new, fine-grained bar-level control signals. To address this, we propose two innovative solutions. First, we introduce a pre-training task designed to link control signals directly with corresponding musical tokens, which helps in achieving a more effective initialization for subsequent fine-tuning. Second, we implement a novel counterfactual loss that promotes better alignment between the generated music and the control prompts. Together, these techniques significantly enhance our ability to control music generation at the bar level, showing a 13.06\% improvement over conventional methods. Our subjective evaluations also confirm that this enhanced control does not compromise the musical quality of the original pre-trained generative model.
