Table of Contents
Fetching ...

Modeling and Benchmarking Spoken Dialogue Rewards with Modality and Colloquialness

Jingyu Lu, Yuhan Wang, Fan Zhuo, Xize Cheng, Changhao Pan, Xueyi Pu, Yifu Chen, Chenyuhao Wen, Tianle Liang, Zhou Zhao

Abstract

The rapid evolution of end-to-end spoken dialogue systems demands transcending mere textual semantics to incorporate paralinguistic nuances and the spontaneous nature of human conversation. However, current methods struggle with two critical gaps: the modality gap, involving prosody and emotion, and the colloquialness gap, distinguishing written scripts from natural speech. To address these challenges, we introduce SDiaReward, an end-to-end multi-turn reward model trained on SDiaReward-Dataset, a novel collection of episode-level preference pairs explicitly targeting these gaps. It operates directly on full multi-turn speech episodes and is optimized with pairwise preference supervision, enabling joint assessment of modality and colloquialness in a single evaluator. We further establish ESDR-Bench, a stratified benchmark for robust episode-level evaluation. Experiments demonstrate that SDiaReward achieves state-of-the-art pairwise preference accuracy, significantly outperforming general-purpose audio LLMs. Further analysis suggests that SDiaReward captures relative conversational expressiveness beyond superficial synthesis cues, improving generalization across domains and recording conditions. Code, data, and demos are available at https://sdiareward.github.io/.

Modeling and Benchmarking Spoken Dialogue Rewards with Modality and Colloquialness

Abstract

The rapid evolution of end-to-end spoken dialogue systems demands transcending mere textual semantics to incorporate paralinguistic nuances and the spontaneous nature of human conversation. However, current methods struggle with two critical gaps: the modality gap, involving prosody and emotion, and the colloquialness gap, distinguishing written scripts from natural speech. To address these challenges, we introduce SDiaReward, an end-to-end multi-turn reward model trained on SDiaReward-Dataset, a novel collection of episode-level preference pairs explicitly targeting these gaps. It operates directly on full multi-turn speech episodes and is optimized with pairwise preference supervision, enabling joint assessment of modality and colloquialness in a single evaluator. We further establish ESDR-Bench, a stratified benchmark for robust episode-level evaluation. Experiments demonstrate that SDiaReward achieves state-of-the-art pairwise preference accuracy, significantly outperforming general-purpose audio LLMs. Further analysis suggests that SDiaReward captures relative conversational expressiveness beyond superficial synthesis cues, improving generalization across domains and recording conditions. Code, data, and demos are available at https://sdiareward.github.io/.
Paper Structure (58 sections, 4 equations, 13 figures, 6 tables)

This paper contains 58 sections, 4 equations, 13 figures, 6 tables.

Figures (13)

  • Figure 1: Challenges in spoken dialogue and our proposed framework. Text-based systems face modality (prosody/emotion) and colloquialness (style) gaps. Unlike rule-based methods, our end-to-end Reward Model learns these features from multi-turn dialogues via data-driven preference signals.
  • Figure 2: Overview of dataset construction. (a) Collection: We collect wild conversational audio (main) along with semi-wild/scripted data. (b1--b2) Processing & Pairing: We process audio into speaker-aware turns and group them into dialogues. We then construct two types of pairs: modality-aware pairs (center) via real vs. TTS audio , and colloquialness pairs (bottom right) via text-style vs. spoken-style generation and style change. (c1--c2) Post-processing: We filter episodes and attach hierarchical metadata (emotion, sentiment, act) for benchmark stratification. The detailed data processing pipeline can be found in Appendix \ref{['sec:appendix_dataset']}.
  • Figure 3: Architecture of our reward model.
  • Figure 4: Ablation Analysis on SDiaReward Model (7B). (a) Score Alignment: The proposed center loss (Orange) effectively anchors the chosen reward distribution to $\mu \approx 0.32$, whereas the baseline (Blue) suffers from significant drift ($\mu > 5.0$). (b) Margin Stability: The discriminative margin remains robust. (c) Density Modes: Split violin plots visualize reward density, showing high confidence in Wild data. (d) Statistical Ranges: Box plots reveal domain-dependent decision boundaries; notably, Scripted responses receive lower absolute scores despite being correct choices.
  • Figure 5: Overview of the ESDR-Bench
  • ...and 8 more figures