Table of Contents
Fetching ...

From Scores to Preferences: Redefining MOS Benchmarking for Speech Quality Reward Modeling

Yifei Cao, Changhao Jiang, Jiabao Zhuang, Jiajun Sun, Ming Zhang, Zhiheng Xi, Hui Li, Shihan Dou, Yuran Wang, Yunke Zhang, Tao Ji, Tao Gui, Qi Zhang, Xuanjing Huang

TL;DR

This work introduces MOS-RMBench, the first unified benchmark that reformulates diverse MOS datasets into a preference-based framework for robust cross-dataset evaluation of speech quality reward models. It systematically compares scalar, semi-scalar, and generative reward modeling paradigms, finding scalar models generally strongest while revealing a domain gap between human and synthetic speech. To tackle fine-grained discrimination, it proposes a MOS-aware GRM that incorporates an MOS-difference based reward, yielding improved performance on challenging pairs. The benchmark and methodology provide a reproducible foundation for advancing automatic speech quality assessment and preference-alignment in speech generation systems.

Abstract

Assessing the perceptual quality of synthetic speech is crucial for guiding the development and refinement of speech generation models. However, it has traditionally relied on human subjective ratings such as the Mean Opinion Score (MOS), which depend on manual annotations and often suffer from inconsistent rating standards and poor reproducibility. To address these limitations, we introduce MOS-RMBench, a unified benchmark that reformulates diverse MOS datasets into a preference-comparison setting, enabling rigorous evaluation across different datasets. Building on MOS-RMBench, we systematically construct and evaluate three paradigms for reward modeling: scalar reward models, semi-scalar reward models, and generative reward models (GRMs). Our experiments reveal three key findings: (1) scalar models achieve the strongest overall performance, consistently exceeding 74% accuracy; (2) most models perform considerably worse on synthetic speech than on human speech; and (3) all models struggle on pairs with very small MOS differences. To improve performance on these challenging pairs, we propose a MOS-aware GRM that incorporates an MOS-difference-based reward function, enabling the model to adaptively scale rewards according to the difficulty of each sample pair. Experimental results show that the MOS-aware GRM significantly improves fine-grained quality discrimination and narrows the gap with scalar models on the most challenging cases. We hope this work will establish both a benchmark and a methodological framework to foster more rigorous and scalable research in automatic speech quality assessment.

From Scores to Preferences: Redefining MOS Benchmarking for Speech Quality Reward Modeling

TL;DR

This work introduces MOS-RMBench, the first unified benchmark that reformulates diverse MOS datasets into a preference-based framework for robust cross-dataset evaluation of speech quality reward models. It systematically compares scalar, semi-scalar, and generative reward modeling paradigms, finding scalar models generally strongest while revealing a domain gap between human and synthetic speech. To tackle fine-grained discrimination, it proposes a MOS-aware GRM that incorporates an MOS-difference based reward, yielding improved performance on challenging pairs. The benchmark and methodology provide a reproducible foundation for advancing automatic speech quality assessment and preference-alignment in speech generation systems.

Abstract

Assessing the perceptual quality of synthetic speech is crucial for guiding the development and refinement of speech generation models. However, it has traditionally relied on human subjective ratings such as the Mean Opinion Score (MOS), which depend on manual annotations and often suffer from inconsistent rating standards and poor reproducibility. To address these limitations, we introduce MOS-RMBench, a unified benchmark that reformulates diverse MOS datasets into a preference-comparison setting, enabling rigorous evaluation across different datasets. Building on MOS-RMBench, we systematically construct and evaluate three paradigms for reward modeling: scalar reward models, semi-scalar reward models, and generative reward models (GRMs). Our experiments reveal three key findings: (1) scalar models achieve the strongest overall performance, consistently exceeding 74% accuracy; (2) most models perform considerably worse on synthetic speech than on human speech; and (3) all models struggle on pairs with very small MOS differences. To improve performance on these challenging pairs, we propose a MOS-aware GRM that incorporates an MOS-difference-based reward function, enabling the model to adaptively scale rewards according to the difficulty of each sample pair. Experimental results show that the MOS-aware GRM significantly improves fine-grained quality discrimination and narrows the gap with scalar models on the most challenging cases. We hope this work will establish both a benchmark and a methodological framework to foster more rigorous and scalable research in automatic speech quality assessment.

Paper Structure

This paper contains 20 sections, 3 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Overview of MOS-RMBench. Data from multiple MOS datasets are filtered and grouped, then converted into pairwise comparisons with natural-language critiques generated by Gemini-2.5-Pro. The resulting dataset supports training and evaluation of reward models in a consistent and reproducible setting.
  • Figure 2: Distribution of MOS scores in MOS-RMBench. Figures (a) and (b) show the MOS distributions of chosen and rejected samples. Figure (c) presents the distribution of $\Delta MOS$ in the test set.
  • Figure 3: Percentile-based error analysis across datasets: error rates are highest for pairs with small MOS differences and decline markedly as the MOS gap widens.
  • Figure 4: Performance comparison of MOS-aware GRMs and standard GRMs trained with different reinforcement methods on samples with MOS difference $\le 0.5$.
  • Figure 5: Prompt structure for Gemini-2.5-Pro to annotate a single-audio critic.
  • ...and 4 more figures