Table of Contents
Fetching ...

SageLM: A Multi-aspect and Explainable Large Language Model for Speech Judgement

Yuan Ge, Junxiang Zhang, Xiaoqian Liu, Bei Li, Xiangnan Ma, Chenglong Wang, Kaiyang Ye, Yangfan Du, Linfeng Zhang, Yuxin Huang, Tong Xiao, Zhengtao Yu, JingBo Zhu

TL;DR

SageLM introduces an end-to-end, explainable judge for speech-to-speech dialogue evaluation, addressing the insufficiency of cascaded ASR-based and human-evaluated methods. It learns from SpeechFeedback, a large multi-aspect dataset that covers semantic and acoustic judgments, and employs a two-stage training regime with rationale-augmented supervision to align judgments with explanations. Empirical results show SageLM achieving 82.79% agreement with human judgments, surpassing cascaded and SLM baselines by at least 7.42% and 26.20%, respectively, and the approach demonstrates robust generalization and explainability. The work provides a scalable framework and rich dataset to advance evaluation for S2S LLMs, with clear directions for extending to multi-turn, multilingual, and full-duplex interactions.

Abstract

Speech-to-Speech (S2S) Large Language Models (LLMs) are foundational to natural human-computer interaction, enabling end-to-end spoken dialogue systems. However, evaluating these models remains a fundamental challenge. We propose \texttt{SageLM}, an end-to-end, multi-aspect, and explainable speech LLM for comprehensive S2S LLMs evaluation. First, unlike cascaded approaches that disregard acoustic features, SageLM jointly assesses both semantic and acoustic dimensions. Second, it leverages rationale-based supervision to enhance explainability and guide model learning, achieving superior alignment with evaluation outcomes compared to rule-based reinforcement learning methods. Third, we introduce \textit{SpeechFeedback}, a synthetic preference dataset, and employ a two-stage training paradigm to mitigate the scarcity of speech preference data. Trained on both semantic and acoustic dimensions, SageLM achieves an 82.79\% agreement rate with human evaluators, outperforming cascaded and SLM-based baselines by at least 7.42\% and 26.20\%, respectively.

SageLM: A Multi-aspect and Explainable Large Language Model for Speech Judgement

TL;DR

SageLM introduces an end-to-end, explainable judge for speech-to-speech dialogue evaluation, addressing the insufficiency of cascaded ASR-based and human-evaluated methods. It learns from SpeechFeedback, a large multi-aspect dataset that covers semantic and acoustic judgments, and employs a two-stage training regime with rationale-augmented supervision to align judgments with explanations. Empirical results show SageLM achieving 82.79% agreement with human judgments, surpassing cascaded and SLM baselines by at least 7.42% and 26.20%, respectively, and the approach demonstrates robust generalization and explainability. The work provides a scalable framework and rich dataset to advance evaluation for S2S LLMs, with clear directions for extending to multi-turn, multilingual, and full-duplex interactions.

Abstract

Speech-to-Speech (S2S) Large Language Models (LLMs) are foundational to natural human-computer interaction, enabling end-to-end spoken dialogue systems. However, evaluating these models remains a fundamental challenge. We propose \texttt{SageLM}, an end-to-end, multi-aspect, and explainable speech LLM for comprehensive S2S LLMs evaluation. First, unlike cascaded approaches that disregard acoustic features, SageLM jointly assesses both semantic and acoustic dimensions. Second, it leverages rationale-based supervision to enhance explainability and guide model learning, achieving superior alignment with evaluation outcomes compared to rule-based reinforcement learning methods. Third, we introduce \textit{SpeechFeedback}, a synthetic preference dataset, and employ a two-stage training paradigm to mitigate the scarcity of speech preference data. Trained on both semantic and acoustic dimensions, SageLM achieves an 82.79\% agreement rate with human evaluators, outperforming cascaded and SLM-based baselines by at least 7.42\% and 26.20\%, respectively.

Paper Structure

This paper contains 55 sections, 5 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Recent speech-to-speech LLMs evaluation methods rely on human annotations or cascaded pipelines. We propose SageLM, an end-to-end speech dialogue evaluator that provides explainable judgment results across five aspects, including both semantic and acoustic dimensions.
  • Figure 2: Data construction pipeline of SpeechFeedback.
  • Figure 3: Preliminary: Reinforcement Learning versus Supervised Fine-Tuning on three evaluation metrics while training data scaling up (4k$\sim$24k $\times$ 4 aspects).
  • Figure 4: Analysis of the impact of stage1 semantic evaluation training and stage2 acoustic evaluation training.
  • Figure 5: Agreement vs. Combined responses pairs length.
  • ...and 2 more figures