Table of Contents
Fetching ...

GSRM: Generative Speech Reward Model for Speech RLHF

Maohao Shen, Tejas Jayashankar, Osama Hanna, Naoyuki Kanda, Yancheng Wang, Kateřina Žmolíková, Ruiming Xie, Niko Moritz, Anfeng Xu, Yashesh Gaur, Gregory Wornell, Qing He, Jilong Wu

TL;DR

Inspired by recent advances in generative reward modeling, the GSRM is proposed, a reasoning-centric reward model tailored for speech that substantially outperforms existing speech naturalness predictors and can improve the naturalness of speech LLM generations by serving as an effective verifier for online RLHF.

Abstract

Recent advances in speech language models, such as GPT-4o Voice Mode and Gemini Live, have demonstrated promising speech generation capabilities. Nevertheless, the aesthetic naturalness of the synthesized audio still lags behind that of human speech. Enhancing generation quality requires a reliable evaluator of speech naturalness. However, existing naturalness evaluators typically regress raw audio to scalar scores, offering limited interpretability of the evaluation and moreover fail to generalize to speech across different taxonomies. Inspired by recent advances in generative reward modeling, we propose the Generative Speech Reward Model (GSRM), a reasoning-centric reward model tailored for speech. The GSRM is trained to decompose speech naturalness evaluation into an interpretable acoustic feature extraction stage followed by feature-grounded chain-of-thought reasoning, enabling explainable judgments. To achieve this, we curated a large-scale human feedback dataset comprising 31k expert ratings and an out-of-domain benchmark of real-world user-assistant speech interactions. Experiments show that GSRM substantially outperforms existing speech naturalness predictors, achieving model-human correlation of naturalness score prediction that approaches human inter-rater consistency. We further show how GSRM can improve the naturalness of speech LLM generations by serving as an effective verifier for online RLHF.

GSRM: Generative Speech Reward Model for Speech RLHF

TL;DR

Inspired by recent advances in generative reward modeling, the GSRM is proposed, a reasoning-centric reward model tailored for speech that substantially outperforms existing speech naturalness predictors and can improve the naturalness of speech LLM generations by serving as an effective verifier for online RLHF.

Abstract

Recent advances in speech language models, such as GPT-4o Voice Mode and Gemini Live, have demonstrated promising speech generation capabilities. Nevertheless, the aesthetic naturalness of the synthesized audio still lags behind that of human speech. Enhancing generation quality requires a reliable evaluator of speech naturalness. However, existing naturalness evaluators typically regress raw audio to scalar scores, offering limited interpretability of the evaluation and moreover fail to generalize to speech across different taxonomies. Inspired by recent advances in generative reward modeling, we propose the Generative Speech Reward Model (GSRM), a reasoning-centric reward model tailored for speech. The GSRM is trained to decompose speech naturalness evaluation into an interpretable acoustic feature extraction stage followed by feature-grounded chain-of-thought reasoning, enabling explainable judgments. To achieve this, we curated a large-scale human feedback dataset comprising 31k expert ratings and an out-of-domain benchmark of real-world user-assistant speech interactions. Experiments show that GSRM substantially outperforms existing speech naturalness predictors, achieving model-human correlation of naturalness score prediction that approaches human inter-rater consistency. We further show how GSRM can improve the naturalness of speech LLM generations by serving as an effective verifier for online RLHF.
Paper Structure (34 sections, 10 figures, 9 tables)

This paper contains 34 sections, 10 figures, 9 tables.

Figures (10)

  • Figure 1: Comparison of Naturalness Predictors. Generative Speech Reward Model (GSRM) integrates explicit acoustic feature extraction with feature-grounded chain-of-thought reasoning. Unlike other approaches, GSRM produces an interpretable evidence log derived from raw audio before reasoning or judging final numerical ratings.
  • Figure 2: CoT Synthesis Framework of GSRM. For each utterance segment, the synthesis pipeline first extracts structured, vowel-level acoustic features from the segment and then synthesizes detailed reasoning to connect paralinguistic cues with perceptual judgments. This process is applied across all utterance segments to construct a complete evidence log for the audio, which is subsequently combined with a global judgment CoT to form the final training trajectory for training GSRM.
  • Figure 3: GSRM for Speech Online RLHF. Given a user query, a speech LLM acting as the generator produces a spoken response. Both the generated speech and its text transcript are provided to GSRM, which outputs ratings across multiple acoustic and semantic dimensions. A reward aggregator combines these ratings into a scalar reward that is used as feedback to guide online reinforcement learning updates for the generator.
  • Figure 4: Scaling Behavior of GSRM. The y-axis reports PCC performance on the FDX-Conv OOD set.
  • Figure 5: Ablation Studies of Impact of Sub-metrics and Acoustic Features to Naturalness Prediction.
  • ...and 5 more figures