Table of Contents
Fetching ...

MuQ-Eval: An Open-Source Per-Sample Quality Metric for AI Music Generation Evaluation

Di Zhu, Zixuan Li

Abstract

Distributional metrics such as Fréchet Audio Distance cannot score individual music clips and correlate poorly with human judgments, while the only per-sample learned metric achieving high human correlation is closed-source. We introduce MUQ-EVAL, an open-source per-sample quality metric for AIgenerated music built by training lightweight prediction heads on frozen MuQ-310M features using MusicEval, a dataset of generated clips from 31 text-to-music systems with expert quality ratings. Our simplest model, frozen features with attention pooling and a two-layer MLP, achieves system-level SRCC = 0.957 and utterance-level SRCC = 0.838 with human mean opinion scores. A systematic ablation over training objectives and adaptation strategies shows that no addition meaningfully improves the frozen baseline, indicating that frozen MuQ representations already capture quality-relevant information. Encoder choice is the dominant design factor, outweighing all architectural and training decisions. LoRA-adapted models trained on as few as 150 clips already achieve usable correlation, enabling personalized quality evaluators from individual listener annotations. A controlled degradation analysis reveals selective sensitivity to signal-level artifacts but insensitivity to musical-structural distortions. Our metric, MUQ-EVAL, is fully open-source, outperforms existing open per-sample metrics, and runs in real time on a single consumer GPU. Code, model weights, and evaluation scripts are available at https://github.com/dgtql/MuQ-Eval.

MuQ-Eval: An Open-Source Per-Sample Quality Metric for AI Music Generation Evaluation

Abstract

Distributional metrics such as Fréchet Audio Distance cannot score individual music clips and correlate poorly with human judgments, while the only per-sample learned metric achieving high human correlation is closed-source. We introduce MUQ-EVAL, an open-source per-sample quality metric for AIgenerated music built by training lightweight prediction heads on frozen MuQ-310M features using MusicEval, a dataset of generated clips from 31 text-to-music systems with expert quality ratings. Our simplest model, frozen features with attention pooling and a two-layer MLP, achieves system-level SRCC = 0.957 and utterance-level SRCC = 0.838 with human mean opinion scores. A systematic ablation over training objectives and adaptation strategies shows that no addition meaningfully improves the frozen baseline, indicating that frozen MuQ representations already capture quality-relevant information. Encoder choice is the dominant design factor, outweighing all architectural and training decisions. LoRA-adapted models trained on as few as 150 clips already achieve usable correlation, enabling personalized quality evaluators from individual listener annotations. A controlled degradation analysis reveals selective sensitivity to signal-level artifacts but insensitivity to musical-structural distortions. Our metric, MUQ-EVAL, is fully open-source, outperforms existing open per-sample metrics, and runs in real time on a single consumer GPU. Code, model weights, and evaluation scripts are available at https://github.com/dgtql/MuQ-Eval.
Paper Structure (56 sections, 8 equations, 6 figures, 7 tables)

This paper contains 56 sections, 8 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Training dynamics: (a) validation SRCC vs. epoch and (b) training loss vs. epoch for all configurations (fold 0). All MuQ variants converge to similar validation SRCC.
  • Figure 2: System-level scatter: predicted MuQ-Eval scores vs. human MOS for 31 TTM models (A1, fold 0). Dashed: linear fit; dotted: $y = x$ reference.
  • Figure 3: Utterance-level scatter: per-clip predicted vs. human MOS (MI dimension, A1, fold 0, $\sim$385 clips). Higher density along the diagonal indicates strong per-sample agreement.
  • Figure 4: Progressive ablation: system-level (solid bars) and utterance-level (hatched bars) SRCC(MI) per experiment. Trend lines connect values across experiments. All MuQ variants cluster tightly at both granularities, confirming that progressive complexity does not yield cumulative improvement.
  • Figure 5: Degradation concordance heatmap (A3b, fold 0). The metric reliably detects signal-level artifacts (MP3, noise) at severe levels but is insensitive to musical-structural distortions (pitch, tempo) at all severities.
  • ...and 1 more figures