Table of Contents
Fetching ...

SpeechJudge: Towards Human-Level Judgment for Speech Naturalness

Xueyao Zhang, Chaoren Wang, Huan Liao, Ziniu Li, Yuancheng Wang, Li Wang, Dongya Jia, Yuanzhe Chen, Xiulin Li, Zhuo Chen, Zhizheng Wu

TL;DR

SpeechJudge tackles the challenge of aligning speech synthesis with human perception of naturalness by providing a large-scale human-preference dataset (SpeechJudge-Data), a dedicated evaluation benchmark (SpeechJudge-Eval), and a generative reward model (SpeechJudge-GRM) trained with a two-stage SFT+RL workflow. The dataset spans multiple zero-shot TTS models, languages, and expressive styles, with intelligibility and naturalness annotations collected from numerous annotators. On SpeechJudge-Eval, traditional metrics and AudioLLMs lag behind human judgments, while SpeechJudge-GRM achieves 77.2% accuracy (79.4% with inference-time voting), surpassing a classic Bradley-Terry baseline. The work further demonstrates how GRMs can serve as reward functions for post-training and sample selection, releasing data, code, and models to foster future human-aligned improvements in speech naturalness. Limitations include language bias and expressive-speech challenges, pointing to avenues for broader multilingual data and more fine-grained, style-aware evaluation.

Abstract

Aligning large generative models with human feedback is a critical challenge. In speech synthesis, this is particularly pronounced due to the lack of a large-scale human preference dataset, which hinders the development of models that truly align with human perception. To address this, we introduce SpeechJudge, a comprehensive suite comprising a dataset, a benchmark, and a reward model centered on naturalness--one of the most fundamental subjective metrics for speech synthesis. First, we present SpeechJudge-Data, a large-scale human feedback corpus of 99K speech pairs. The dataset is constructed using a diverse set of advanced zero-shot text-to-speech (TTS) models across diverse speech styles and multiple languages, with human annotations for both intelligibility and naturalness preference. From this, we establish SpeechJudge-Eval, a challenging benchmark for speech naturalness judgment. Our evaluation reveals that existing metrics and AudioLLMs struggle with this task; the leading model, Gemini-2.5-Flash, achieves less than 70% agreement with human judgment, highlighting a significant gap for improvement. To bridge this gap, we develop SpeechJudge-GRM, a generative reward model (GRM) based on Qwen2.5-Omni-7B. It is trained on SpeechJudge-Data via a two-stage post-training process: Supervised Fine-Tuning (SFT) with Chain-of-Thought rationales followed by Reinforcement Learning (RL) with GRPO on challenging cases. On the SpeechJudge-Eval benchmark, the proposed SpeechJudge-GRM demonstrates superior performance, achieving 77.2% accuracy (and 79.4% after inference-time scaling @10) compared to a classic Bradley-Terry reward model (72.7%). Furthermore, SpeechJudge-GRM can be also employed as a reward function during the post-training of speech generation models to facilitate their alignment with human preferences.

SpeechJudge: Towards Human-Level Judgment for Speech Naturalness

TL;DR

SpeechJudge tackles the challenge of aligning speech synthesis with human perception of naturalness by providing a large-scale human-preference dataset (SpeechJudge-Data), a dedicated evaluation benchmark (SpeechJudge-Eval), and a generative reward model (SpeechJudge-GRM) trained with a two-stage SFT+RL workflow. The dataset spans multiple zero-shot TTS models, languages, and expressive styles, with intelligibility and naturalness annotations collected from numerous annotators. On SpeechJudge-Eval, traditional metrics and AudioLLMs lag behind human judgments, while SpeechJudge-GRM achieves 77.2% accuracy (79.4% with inference-time voting), surpassing a classic Bradley-Terry baseline. The work further demonstrates how GRMs can serve as reward functions for post-training and sample selection, releasing data, code, and models to foster future human-aligned improvements in speech naturalness. Limitations include language bias and expressive-speech challenges, pointing to avenues for broader multilingual data and more fine-grained, style-aware evaluation.

Abstract

Aligning large generative models with human feedback is a critical challenge. In speech synthesis, this is particularly pronounced due to the lack of a large-scale human preference dataset, which hinders the development of models that truly align with human perception. To address this, we introduce SpeechJudge, a comprehensive suite comprising a dataset, a benchmark, and a reward model centered on naturalness--one of the most fundamental subjective metrics for speech synthesis. First, we present SpeechJudge-Data, a large-scale human feedback corpus of 99K speech pairs. The dataset is constructed using a diverse set of advanced zero-shot text-to-speech (TTS) models across diverse speech styles and multiple languages, with human annotations for both intelligibility and naturalness preference. From this, we establish SpeechJudge-Eval, a challenging benchmark for speech naturalness judgment. Our evaluation reveals that existing metrics and AudioLLMs struggle with this task; the leading model, Gemini-2.5-Flash, achieves less than 70% agreement with human judgment, highlighting a significant gap for improvement. To bridge this gap, we develop SpeechJudge-GRM, a generative reward model (GRM) based on Qwen2.5-Omni-7B. It is trained on SpeechJudge-Data via a two-stage post-training process: Supervised Fine-Tuning (SFT) with Chain-of-Thought rationales followed by Reinforcement Learning (RL) with GRPO on challenging cases. On the SpeechJudge-Eval benchmark, the proposed SpeechJudge-GRM demonstrates superior performance, achieving 77.2% accuracy (and 79.4% after inference-time scaling @10) compared to a classic Bradley-Terry reward model (72.7%). Furthermore, SpeechJudge-GRM can be also employed as a reward function during the post-training of speech generation models to facilitate their alignment with human preferences.

Paper Structure

This paper contains 34 sections, 3 equations, 12 figures, 14 tables.

Figures (12)

  • Figure 1: SpeechJudge-Data consists of speech pairs (with corresponding text) synthesized by multiple zero-shot TTS models. For each pair, human annotators need to perform (a) a pointwise annotation of text accuracy to assess intelligibility, and (b) a pairwise preference annotation to judge the relative speech naturalness.
  • Figure 2: Distribution of SpeechJudge-Data.
  • Figure 3: Distribution of SpeechJudge-Data on different levels of human agreement.
  • Figure 4: SpeechJudge-GRM: (a) We employ Gemini-2.5-Flash as a teacher model to generate CoT rationales for SpeechJudge-Data. We use the samples where Gemini-2.5-Flash's preference aligns with human as the SFT dataset, while the remaining samples are reserved for the RL stage. (b) We treat the human preference as a verifiable reward to train the GRM with GRPO.
  • Figure 5: Subjective evaluation of using SpeechJudge-GRM for high-naturalness sample selection. Human subjects compare a best-of-100 output of Qwen2.5-Omni-7B (Talker), chosen by either SpeechJudge-BTRM or SpeechJudge-GRM, against a randomly output.
  • ...and 7 more figures