TELEVAL: A Dynamic Benchmark Designed for Spoken Language Models in Chinese Interactive Scenarios
Zehan Li, Hongjie Chen, Qing Wang, Yuxin Zhang, Jing Zhou, Xuening Wang, Hang Lv, Mengjie Du, Yaodong Song, Jie Lian, Jian Kang, Jie Li, Yongxiang Li, Xuelong Li
TL;DR
TELEVAL introduces a dynamic, user-centered benchmark for evaluating Chinese spoken language models in natural interactive scenarios. It combines two core aspects—Reliable Content Fulfillment and Interactional Appropriateness—across a large, evolving dataset that includes real and synthetic audio, dialect variation, and paralinguistic cues. The evaluation framework mixes objective text matching, calibrated LLM scoring, and diverse audio metrics, enabling robust cross-model comparisons and exposing gaps in pragmatic, interaction-aware abilities. The findings reveal strong semantic capabilities but limited ability to adapt responses to paralinguistic signals and dialectal nuances, underscoring the need for more interaction-faithful benchmarks in real-world conversations.
Abstract
Spoken language models (SLMs) have advanced rapidly in recent years, accompanied by a growing number of evaluation benchmarks. However, most existing benchmarks emphasize task completion and capability scaling, while remaining poorly aligned with how users interact with SLMs in real-world spoken conversations. Effective spoken interaction requires not only accurate understanding of user intent and content, but also the ability to respond with appropriate interactional strategies. In this paper, we present TELEVAL, a dynamic, user-centered benchmark for evaluating SLMs in realistic Chinese spoken interaction scenarios. TELEVAL consolidates evaluation into two core aspects. Reliable Content Fulfillment assesses whether models can comprehend spoken inputs and produce semantically correct responses. Interactional Appropriateness evaluates whether models act as socially capable interlocutors, requiring them not only to generate human-like, colloquial responses, but also to implicitly incorporate paralinguistic cues for natural interaction. Experiments reveal that, despite strong performance on semantic and knowledge-oriented tasks, current SLMs still struggle to produce natural and interactionally appropriate responses, highlighting the need for more interaction-faithful evaluation.
