WavBench: Benchmarking Reasoning, Colloquialism, and Paralinguistics for End-to-End Spoken Dialogue Models
Yangzhuo Li, Shengpeng Ji, Yifu Chen, Tianle Liang, Haorong Ying, Yule Wang, Junbo Li, Jun Fang, Zhou Zhao
TL;DR
WavBench introduces a holistic benchmark for end-to-end spoken dialogue models that jointly evaluates advanced reasoning, spoken-colloquial delivery, and paralinguistic fidelity. It defines a tripartite framework (Pro Colloquial Expression, Basic Colloquial Expression, and Acoustic Interaction) encompassing 17,577 items across 76.5 hours, and provides two generation pipelines (Colloquial Expression Set and Acoustic Interaction Set) with multi-stage data creation, verification, and synthesis. The benchmark uses a Gemini-based evaluation protocol to score colloquial naturalness and a combination of expert annotations and ground-truth paralinguistic labels for acoustic tasks, including explicit understanding/generation and implicit dialogue. Experimental results on five end-to-end models show GPT-4o Audio leading overall but reveal substantial gaps in cognitive-acoustic alignment and mult-turn paralinguistic consistency, underscoring the need for more robust, reasoning-enabled, acoustically faithful models in real-world spoken dialogue systems.
Abstract
With the rapid integration of advanced reasoning capabilities into spoken dialogue models, the field urgently demands benchmarks that transcend simple interactions to address real-world complexity. However, current evaluations predominantly adhere to text-generation standards, overlooking the unique audio-centric characteristics of paralinguistics and colloquialisms, alongside the cognitive depth required by modern agents. To bridge this gap, we introduce WavBench, a comprehensive benchmark designed to evaluate realistic conversational abilities where prior works fall short. Uniquely, WavBench establishes a tripartite framework: 1) Pro subset, designed to rigorously challenge reasoning-enhanced models with significantly increased difficulty; 2) Basic subset, defining a novel standard for spoken colloquialism that prioritizes "listenability" through natural vocabulary, linguistic fluency, and interactive rapport, rather than rigid written accuracy; and 3) Acoustic subset, covering explicit understanding, generation, and implicit dialogue to rigorously evaluate comprehensive paralinguistic capabilities within authentic real-world scenarios. Through evaluating five state-of-the-art models, WavBench offers critical insights into the intersection of complex problem-solving, colloquial delivery, and paralinguistic fidelity, guiding the evolution of robust spoken dialogue models. The benchmark dataset and evaluation toolkit are available at https://naruto-2024.github.io/wavbench.github.io/.
