Aligning Generative Music AI with Human Preferences: Methods and Challenges
Dorien Herremans, Abhinaba Roy
TL;DR
This work addresses the gap between likelihood-based training and human musical preferences by surveying three main alignment modalities: training-time RLHF and Direct Preference Optimization (DPO), plus inference-time alignment. It highlights two concrete instantiations—MusicRL for large-scale preference learning and DiffRhythm+ for diffusion-based multi-preference control—alongside Text2midi-InferAlign for inference-time optimization, reporting substantial gains in human-judged quality and coherence while noting remaining challenges. The authors argue for a multi-objective, culturally aware framework with scalable data, unified inference-time tooling, and cross-domain evaluation to enable practical applications in interactive composition and personalized music services. Realizing this potential requires interdisciplinary collaboration across machine learning, music theory, cognitive science, and ethics to build systems that truly serve human creativity and experience.
Abstract
Recent advances in generative AI for music have achieved remarkable fidelity and stylistic diversity, yet these systems often fail to align with nuanced human preferences due to the specific loss functions they use. This paper advocates for the systematic application of preference alignment techniques to music generation, addressing the fundamental gap between computational optimization and human musical appreciation. Drawing on recent breakthroughs including MusicRL's large-scale preference learning, multi-preference alignment frameworks like diffusion-based preference optimization in DiffRhythm+, and inference-time optimization techniques like Text2midi-InferAlign, we discuss how these techniques can address music's unique challenges: temporal coherence, harmonic consistency, and subjective quality assessment. We identify key research challenges including scalability to long-form compositions, reliability amongst others in preference modelling. Looking forward, we envision preference-aligned music generation enabling transformative applications in interactive composition tools and personalized music services. This work calls for sustained interdisciplinary research combining advances in machine learning, music-theory to create music AI systems that truly serve human creative and experiential needs.
