Table of Contents
Fetching ...

TELEVAL: A Dynamic Benchmark Designed for Spoken Language Models in Chinese Interactive Scenarios

Zehan Li, Hongjie Chen, Qing Wang, Yuxin Zhang, Jing Zhou, Xuening Wang, Hang Lv, Mengjie Du, Yaodong Song, Jie Lian, Jian Kang, Jie Li, Yongxiang Li, Xuelong Li

TL;DR

TELEVAL introduces a dynamic, user-centered benchmark for evaluating Chinese spoken language models in natural interactive scenarios. It combines two core aspects—Reliable Content Fulfillment and Interactional Appropriateness—across a large, evolving dataset that includes real and synthetic audio, dialect variation, and paralinguistic cues. The evaluation framework mixes objective text matching, calibrated LLM scoring, and diverse audio metrics, enabling robust cross-model comparisons and exposing gaps in pragmatic, interaction-aware abilities. The findings reveal strong semantic capabilities but limited ability to adapt responses to paralinguistic signals and dialectal nuances, underscoring the need for more interaction-faithful benchmarks in real-world conversations.

Abstract

Spoken language models (SLMs) have advanced rapidly in recent years, accompanied by a growing number of evaluation benchmarks. However, most existing benchmarks emphasize task completion and capability scaling, while remaining poorly aligned with how users interact with SLMs in real-world spoken conversations. Effective spoken interaction requires not only accurate understanding of user intent and content, but also the ability to respond with appropriate interactional strategies. In this paper, we present TELEVAL, a dynamic, user-centered benchmark for evaluating SLMs in realistic Chinese spoken interaction scenarios. TELEVAL consolidates evaluation into two core aspects. Reliable Content Fulfillment assesses whether models can comprehend spoken inputs and produce semantically correct responses. Interactional Appropriateness evaluates whether models act as socially capable interlocutors, requiring them not only to generate human-like, colloquial responses, but also to implicitly incorporate paralinguistic cues for natural interaction. Experiments reveal that, despite strong performance on semantic and knowledge-oriented tasks, current SLMs still struggle to produce natural and interactionally appropriate responses, highlighting the need for more interaction-faithful evaluation.

TELEVAL: A Dynamic Benchmark Designed for Spoken Language Models in Chinese Interactive Scenarios

TL;DR

TELEVAL introduces a dynamic, user-centered benchmark for evaluating Chinese spoken language models in natural interactive scenarios. It combines two core aspects—Reliable Content Fulfillment and Interactional Appropriateness—across a large, evolving dataset that includes real and synthetic audio, dialect variation, and paralinguistic cues. The evaluation framework mixes objective text matching, calibrated LLM scoring, and diverse audio metrics, enabling robust cross-model comparisons and exposing gaps in pragmatic, interaction-aware abilities. The findings reveal strong semantic capabilities but limited ability to adapt responses to paralinguistic signals and dialectal nuances, underscoring the need for more interaction-faithful benchmarks in real-world conversations.

Abstract

Spoken language models (SLMs) have advanced rapidly in recent years, accompanied by a growing number of evaluation benchmarks. However, most existing benchmarks emphasize task completion and capability scaling, while remaining poorly aligned with how users interact with SLMs in real-world spoken conversations. Effective spoken interaction requires not only accurate understanding of user intent and content, but also the ability to respond with appropriate interactional strategies. In this paper, we present TELEVAL, a dynamic, user-centered benchmark for evaluating SLMs in realistic Chinese spoken interaction scenarios. TELEVAL consolidates evaluation into two core aspects. Reliable Content Fulfillment assesses whether models can comprehend spoken inputs and produce semantically correct responses. Interactional Appropriateness evaluates whether models act as socially capable interlocutors, requiring them not only to generate human-like, colloquial responses, but also to implicitly incorporate paralinguistic cues for natural interaction. Experiments reveal that, despite strong performance on semantic and knowledge-oriented tasks, current SLMs still struggle to produce natural and interactionally appropriate responses, highlighting the need for more interaction-faithful evaluation.

Paper Structure

This paper contains 24 sections, 3 equations, 7 figures, 11 tables.

Figures (7)

  • Figure 1: Examples from the TELEVAL benchmark. The English transcribe is in Appendix \ref{['sec:example_en']}.
  • Figure 2: Overview of TELEVAL. Figure (a) and (b) illustrate the evaluation capabilities and datasets across different aspects and tasks. Figure (c) provides an overview of the evaluation results, which are normalized by the maximum value across SLMs.
  • Figure 3: Values denote score differences relative to each model's LlamaQA-zh baseline, with darker colors indicating larger degradation. The left panel shows the most extreme condition of each acoustic setting, while the right panel shows relative performance degradation across different dialects. Abbreviations: BG = Background Speaker; Reverb = Reverberation; Distortion = Distortion Coefficient; Low-pass = Low-pass Filter.
  • Figure 4: Task format that not aligned with interactive scenarios.
  • Figure 5: Examples from TELEVAL in English transcribe.
  • ...and 2 more figures