Table of Contents
Fetching ...

MusicRL: Aligning Music Generation to Human Preferences

Geoffrey Cideron, Sertan Girgin, Mauro Verzetti, Damien Vincent, Matej Kastelic, Zalán Borsos, Brian McWilliams, Victor Ungureanu, Olivier Bachem, Olivier Pietquin, Matthieu Geist, Léonard Hussenot, Neil Zeghidour, Andrea Agostinelli

TL;DR

MusicRL demonstrates how reinforcement learning from human feedback can align open-ended music generation with human preferences by combining automatic rewards for text adherence and audio quality with large-scale user preferences. Starting from MusicLM, the authors train multiple RL-finetuned variants, with MusicRL-RU (sequentially incorporating MuLan, quality, and user preferences) delivering the strongest performance both quantitatively and in human evaluations. The approach reveals that preferences extend beyond strict text adherence and quality, underscoring subjectivity in musical appreciation and the value of scalable user feedback. This work lays groundwork for user-centric fine-tuning of audio generative models, highlighting practical benefits and future directions for personalization and on-policy data collection.

Abstract

We propose MusicRL, the first music generation system finetuned from human feedback. Appreciation of text-to-music models is particularly subjective since the concept of musicality as well as the specific intention behind a caption are user-dependent (e.g. a caption such as "upbeat work-out music" can map to a retro guitar solo or a techno pop beat). Not only this makes supervised training of such models challenging, but it also calls for integrating continuous human feedback in their post-deployment finetuning. MusicRL is a pretrained autoregressive MusicLM (Agostinelli et al., 2023) model of discrete audio tokens finetuned with reinforcement learning to maximise sequence-level rewards. We design reward functions related specifically to text-adherence and audio quality with the help from selected raters, and use those to finetune MusicLM into MusicRL-R. We deploy MusicLM to users and collect a substantial dataset comprising 300,000 pairwise preferences. Using Reinforcement Learning from Human Feedback (RLHF), we train MusicRL-U, the first text-to-music model that incorporates human feedback at scale. Human evaluations show that both MusicRL-R and MusicRL-U are preferred to the baseline. Ultimately, MusicRL-RU combines the two approaches and results in the best model according to human raters. Ablation studies shed light on the musical attributes influencing human preferences, indicating that text adherence and quality only account for a part of it. This underscores the prevalence of subjectivity in musical appreciation and calls for further involvement of human listeners in the finetuning of music generation models.

MusicRL: Aligning Music Generation to Human Preferences

TL;DR

MusicRL demonstrates how reinforcement learning from human feedback can align open-ended music generation with human preferences by combining automatic rewards for text adherence and audio quality with large-scale user preferences. Starting from MusicLM, the authors train multiple RL-finetuned variants, with MusicRL-RU (sequentially incorporating MuLan, quality, and user preferences) delivering the strongest performance both quantitatively and in human evaluations. The approach reveals that preferences extend beyond strict text adherence and quality, underscoring subjectivity in musical appreciation and the value of scalable user feedback. This work lays groundwork for user-centric fine-tuning of audio generative models, highlighting practical benefits and future directions for personalization and on-policy data collection.

Abstract

We propose MusicRL, the first music generation system finetuned from human feedback. Appreciation of text-to-music models is particularly subjective since the concept of musicality as well as the specific intention behind a caption are user-dependent (e.g. a caption such as "upbeat work-out music" can map to a retro guitar solo or a techno pop beat). Not only this makes supervised training of such models challenging, but it also calls for integrating continuous human feedback in their post-deployment finetuning. MusicRL is a pretrained autoregressive MusicLM (Agostinelli et al., 2023) model of discrete audio tokens finetuned with reinforcement learning to maximise sequence-level rewards. We design reward functions related specifically to text-adherence and audio quality with the help from selected raters, and use those to finetune MusicLM into MusicRL-R. We deploy MusicLM to users and collect a substantial dataset comprising 300,000 pairwise preferences. Using Reinforcement Learning from Human Feedback (RLHF), we train MusicRL-U, the first text-to-music model that incorporates human feedback at scale. Human evaluations show that both MusicRL-R and MusicRL-U are preferred to the baseline. Ultimately, MusicRL-RU combines the two approaches and results in the best model according to human raters. Ablation studies shed light on the musical attributes influencing human preferences, indicating that text adherence and quality only account for a part of it. This underscores the prevalence of subjectivity in musical appreciation and calls for further involvement of human listeners in the finetuning of music generation models.
Paper Structure (23 sections, 2 equations, 8 figures, 1 table)

This paper contains 23 sections, 2 equations, 8 figures, 1 table.

Figures (8)

  • Figure 1: Results of the qualitative side-by-side evaluation for the RLHF finetuned models. In each X vs. Y comparison, the green bar corresponds to the percentage of times model X was preferred, the yellow bar to the percentage of ties and the red bar to the percentage of times model Y was preferred. MusicRL-R is the MusicLM model finetuned on quality and text adherence reward. MusicRL-U is finetuned on a reward model of user preferences. MusicRL-RU is finetuned sequentially on quality and adherence to text and then on a reward model of user preferences. While every RLHF finetuned version of MusicLM significantly outperforms MusicLM, MusicRL-R and MusicRL-U achieve comparable performance, while MusicRL-RU is overall the preferred model.
  • Figure 2: Given a dataset of music captions, MusicLM generates audio samples that are scored with a reward function. The RL algorithm finetune the model to maximise the received reward.
  • Figure 3: The AI Test Kitchen MusicLM interface. The user can write a prompt or choose from suggestions. Each prompt generates two 20s clips, and the user can label their favorite clip among the two with a trophy.
  • Figure 4: Quality (left) or MuLan score (right) vs KL divergence for the RL-finetuned models. The KL divergence is computed between the RL-finetuned models and MusicLM except for MusicRL-RU where the KL divergence is computed against MusicRL-R. The black cross corresponds to the checkpoint used to start the training of MusicRL-RU. RL-finetuning successfully optimises the quality and the MuLan scores (MusicRL-R). Additionally, optimizing the user preference reward (MusicRL-RU, MusicRL-RU) improves the quality score while marginally decreasing the MuLan score.
  • Figure 5: User Preference Reward Model Score for the different RL-finetuned models. The KL divergence is computed between the RL-finetuned models and MusicLM except for MusicRL-RU where the KL divergence is computed against MusicRL-R. The black cross corresponds to the checkpoint used to start the training of MusicRL-RU. RL-finetuning successfully improves the user preference reward model score of the generations (see MusicRL-U and MusicRL-RU curves). When trained on other rewards (MuLan and/or quality) the user preference reward model score slightly improves.
  • ...and 3 more figures