Table of Contents
Fetching ...

MR-FlowDPO: Multi-Reward Direct Preference Optimization for Flow-Matching Text-to-Music Generation

Alon Ziv, Sanyuan Chen, Andros Tjandra, Yossi Adi, Wei-Ning Hsu, Bowen Shi

TL;DR

MR-FlowDPO tackles the challenge of aligning text-to-music generation with human preferences by fine-tuning Flow-Matching models with Direct Preference Optimization across multiple rewards. It introduces MRSD, a pairing scheme that ensures positive samples dominate across text alignment, production quality, and semantic coherence, and leverages reward prompting to integrate these signals during inference. A HuBERT-based semantic consistency reward, learned on music data, alongside a CLAP-based text alignment and an aesthetic production quality predictor, together drive richer musicality and rhythmic stability. Empirical results—both objective metrics and human evaluations—demonstrate substantial improvements over strong baselines, highlighting the practical potential of multi-reward preference alignment for music generation.

Abstract

A key challenge in music generation models is their lack of direct alignment with human preferences, as music evaluation is inherently subjective and varies widely across individuals. We introduce MR-FlowDPO, a novel approach that enhances flow-matching-based music generation models - a major class of modern music generative models, using Direct Preference Optimization (DPO) with multiple musical rewards. The rewards are crafted to assess music quality across three key dimensions: text alignment, audio production quality, and semantic consistency, utilizing scalable off-the-shelf models for each reward prediction. We employ these rewards in two ways: (i) By constructing preference data for DPO and (ii) by integrating the rewards into text prompting. To address the ambiguity in musicality evaluation, we propose a novel scoring mechanism leveraging semantic self-supervised representations, which significantly improves the rhythmic stability of generated music. We conduct an extensive evaluation using a variety of music-specific objective metrics as well as a human study. Results show that MR-FlowDPO significantly enhances overall music generation quality and is consistently preferred over highly competitive baselines in terms of audio quality, text alignment, and musicality. Our code is publicly available at https://github.com/lonzi/mrflow_dpo. Samples are provided in our demo page at https://lonzi.github.io/mr_flowdpo_demopage/.

MR-FlowDPO: Multi-Reward Direct Preference Optimization for Flow-Matching Text-to-Music Generation

TL;DR

MR-FlowDPO tackles the challenge of aligning text-to-music generation with human preferences by fine-tuning Flow-Matching models with Direct Preference Optimization across multiple rewards. It introduces MRSD, a pairing scheme that ensures positive samples dominate across text alignment, production quality, and semantic coherence, and leverages reward prompting to integrate these signals during inference. A HuBERT-based semantic consistency reward, learned on music data, alongside a CLAP-based text alignment and an aesthetic production quality predictor, together drive richer musicality and rhythmic stability. Empirical results—both objective metrics and human evaluations—demonstrate substantial improvements over strong baselines, highlighting the practical potential of multi-reward preference alignment for music generation.

Abstract

A key challenge in music generation models is their lack of direct alignment with human preferences, as music evaluation is inherently subjective and varies widely across individuals. We introduce MR-FlowDPO, a novel approach that enhances flow-matching-based music generation models - a major class of modern music generative models, using Direct Preference Optimization (DPO) with multiple musical rewards. The rewards are crafted to assess music quality across three key dimensions: text alignment, audio production quality, and semantic consistency, utilizing scalable off-the-shelf models for each reward prediction. We employ these rewards in two ways: (i) By constructing preference data for DPO and (ii) by integrating the rewards into text prompting. To address the ambiguity in musicality evaluation, we propose a novel scoring mechanism leveraging semantic self-supervised representations, which significantly improves the rhythmic stability of generated music. We conduct an extensive evaluation using a variety of music-specific objective metrics as well as a human study. Results show that MR-FlowDPO significantly enhances overall music generation quality and is consistently preferred over highly competitive baselines in terms of audio quality, text alignment, and musicality. Our code is publicly available at https://github.com/lonzi/mrflow_dpo. Samples are provided in our demo page at https://lonzi.github.io/mr_flowdpo_demopage/.

Paper Structure

This paper contains 39 sections, 4 equations, 3 figures, 11 tables, 1 algorithm.

Figures (3)

  • Figure 1: An overview of MR-FlowDPO. $k$ music samples are generated using the reference model to obtain various automatic reward scores (e.g., production quality), which are later used for preference optimization of MR-FlowDPO.
  • Figure 2: Human Study - Win Rate of MR-FlowDPO-1B against MelodyFlow-1B, evaluated on four axes - Overall preference (Overall), audio quality (Quality), musicality (Musicality) and text alignment (Text).
  • Figure 3: Human Annotation User Interface