Table of Contents
Fetching ...

Aligning Text-to-Music Evaluation with Human Preferences

Yichen Huang, Zachary Novack, Koichi Saito, Jiatong Shi, Shinji Watanabe, Yuki Mitsufuji, John Thickstun, Chris Donahue

TL;DR

This work tackles the problem of evaluating text-to-music (TTM) outputs by analyzing a broad design space of reference-based divergences and proposing a robust metric, MAD, built on self-supervised audio embeddings. Through a synthetic meta-evaluation across four musical desiderata and a large, open dataset of human preferences (MusicPrefs), the authors demonstrate that MAD better correlates with human judgments than the traditional Fréchet Audio Distance (FAD) and various baselines. They further release MusicPrefs and provide evidence that MAD generalizes beyond synthetic degradations to real human preferences, offering a practical automatic evaluation tool for open-weight TTM systems. The work advances reproducible, human-aligned evaluation for TTM and offers actionable insights for the development and benchmarking of future models.

Abstract

Despite significant recent advances in generative acoustic text-to-music (TTM) modeling, robust evaluation of these models lags behind, relying in particular on the popular Fréchet Audio Distance (FAD). In this work, we rigorously study the design space of reference-based divergence metrics for evaluating TTM models through (1) designing four synthetic meta-evaluations to measure sensitivity to particular musical desiderata, and (2) collecting and evaluating on MusicPrefs, the first open-source dataset of human preferences for TTM systems. We find that not only is the standard FAD setup inconsistent on both synthetic and human preference data, but that nearly all existing metrics fail to effectively capture desiderata, and are only weakly correlated with human perception. We propose a new metric, the MAUVE Audio Divergence (MAD), computed on representations from a self-supervised audio embedding model. We find that this metric effectively captures diverse musical desiderata (average rank correlation 0.84 for MAD vs. 0.49 for FAD and also correlates more strongly with MusicPrefs (0.62 vs. 0.14).

Aligning Text-to-Music Evaluation with Human Preferences

TL;DR

This work tackles the problem of evaluating text-to-music (TTM) outputs by analyzing a broad design space of reference-based divergences and proposing a robust metric, MAD, built on self-supervised audio embeddings. Through a synthetic meta-evaluation across four musical desiderata and a large, open dataset of human preferences (MusicPrefs), the authors demonstrate that MAD better correlates with human judgments than the traditional Fréchet Audio Distance (FAD) and various baselines. They further release MusicPrefs and provide evidence that MAD generalizes beyond synthetic degradations to real human preferences, offering a practical automatic evaluation tool for open-weight TTM systems. The work advances reproducible, human-aligned evaluation for TTM and offers actionable insights for the development and benchmarking of future models.

Abstract

Despite significant recent advances in generative acoustic text-to-music (TTM) modeling, robust evaluation of these models lags behind, relying in particular on the popular Fréchet Audio Distance (FAD). In this work, we rigorously study the design space of reference-based divergence metrics for evaluating TTM models through (1) designing four synthetic meta-evaluations to measure sensitivity to particular musical desiderata, and (2) collecting and evaluating on MusicPrefs, the first open-source dataset of human preferences for TTM systems. We find that not only is the standard FAD setup inconsistent on both synthetic and human preference data, but that nearly all existing metrics fail to effectively capture desiderata, and are only weakly correlated with human perception. We propose a new metric, the MAUVE Audio Divergence (MAD), computed on representations from a self-supervised audio embedding model. We find that this metric effectively captures diverse musical desiderata (average rank correlation 0.84 for MAD vs. 0.49 for FAD and also correlates more strongly with MusicPrefs (0.62 vs. 0.14).

Paper Structure

This paper contains 27 sections, 1 equation, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Overview of our proposed automatic evaluation metric (MAD) and open dataset of human preferences for TTM (MusicPrefs). Given a collection of open TTM models, we present a thorough analysis of different reference-based divergence metrics and embedding backbones. Then, by collecting the first open source dataset of TTM human preference data MusicPrefs, we measure how well the induced rankings of different divergence metrics correlate with human preferences.
  • Figure 2: Aggregated results for synthetic meta-evaluation. Each plot shows the adjusted metric scores against round truth levels of distortion, where the desired behavior is for the scores to monotonically decrease for each aspect. The metric scores are then normalized to [0, 1] and averaged across embedding models. Shaded areas show standard deviations.
  • Figure 3: Each row shows the proportion of instances where one system is preferred over any other individual systems according to the pooled musicality and fidelity judgments. * indicates statistical significance with $P < 0.05$ under the Wilcoxon signed rank test. Note that win rates for opposite sides do not sum to one as we allow ties.
  • Figure 4: Synthetic meta-evaluation results with oracle reference sets aggregated across embedding models. Error bars show the standard deviations.
  • Figure 5: Average $\tau$ when varying the size of the evaluated music set.
  • ...and 2 more figures