WavBench: Benchmarking Reasoning, Colloquialism, and Paralinguistics for End-to-End Spoken Dialogue Models

Yangzhuo Li; Shengpeng Ji; Yifu Chen; Tianle Liang; Haorong Ying; Yule Wang; Junbo Li; Jun Fang; Zhou Zhao

WavBench: Benchmarking Reasoning, Colloquialism, and Paralinguistics for End-to-End Spoken Dialogue Models

Yangzhuo Li, Shengpeng Ji, Yifu Chen, Tianle Liang, Haorong Ying, Yule Wang, Junbo Li, Jun Fang, Zhou Zhao

TL;DR

WavBench introduces a holistic benchmark for end-to-end spoken dialogue models that jointly evaluates advanced reasoning, spoken-colloquial delivery, and paralinguistic fidelity. It defines a tripartite framework (Pro Colloquial Expression, Basic Colloquial Expression, and Acoustic Interaction) encompassing 17,577 items across 76.5 hours, and provides two generation pipelines (Colloquial Expression Set and Acoustic Interaction Set) with multi-stage data creation, verification, and synthesis. The benchmark uses a Gemini-based evaluation protocol to score colloquial naturalness and a combination of expert annotations and ground-truth paralinguistic labels for acoustic tasks, including explicit understanding/generation and implicit dialogue. Experimental results on five end-to-end models show GPT-4o Audio leading overall but reveal substantial gaps in cognitive-acoustic alignment and mult-turn paralinguistic consistency, underscoring the need for more robust, reasoning-enabled, acoustically faithful models in real-world spoken dialogue systems.

Abstract

With the rapid integration of advanced reasoning capabilities into spoken dialogue models, the field urgently demands benchmarks that transcend simple interactions to address real-world complexity. However, current evaluations predominantly adhere to text-generation standards, overlooking the unique audio-centric characteristics of paralinguistics and colloquialisms, alongside the cognitive depth required by modern agents. To bridge this gap, we introduce WavBench, a comprehensive benchmark designed to evaluate realistic conversational abilities where prior works fall short. Uniquely, WavBench establishes a tripartite framework: 1) Pro subset, designed to rigorously challenge reasoning-enhanced models with significantly increased difficulty; 2) Basic subset, defining a novel standard for spoken colloquialism that prioritizes "listenability" through natural vocabulary, linguistic fluency, and interactive rapport, rather than rigid written accuracy; and 3) Acoustic subset, covering explicit understanding, generation, and implicit dialogue to rigorously evaluate comprehensive paralinguistic capabilities within authentic real-world scenarios. Through evaluating five state-of-the-art models, WavBench offers critical insights into the intersection of complex problem-solving, colloquial delivery, and paralinguistic fidelity, guiding the evolution of robust spoken dialogue models. The benchmark dataset and evaluation toolkit are available at https://naruto-2024.github.io/wavbench.github.io/.

WavBench: Benchmarking Reasoning, Colloquialism, and Paralinguistics for End-to-End Spoken Dialogue Models

TL;DR

Abstract

Paper Structure (21 sections, 10 figures, 2 tables)

This paper contains 21 sections, 10 figures, 2 tables.

Introduction
Related work
Spoken Dialogue System
Spoken Language Benchmark
WavBench
Overview
Data Statistics.
Colloquial Expression Set Generation Pipeline
Acoustic Interaction Set Generation Pipeline
Benchmark for End to End Spoken Diglogue models
Task Definition
Evaluation Metrics
Experimental Results
Colloquial Expression Pro
Colloquial Expression Basic
...and 6 more sections

Figures (10)

Figure 1: Overview of WavBench results comparing five end-to-end spoken dialogue models across colloquial expression (Basic/Pro), explicit instruction understanding/generation, and implicit dialogue
Figure 2: The emotional quotient gap between cascaded and end-to-end spoken dialogue models is primarily reflected in their ability to understand and generate paralinguistic features.
Figure 3: Examples of Acoustic Interaction in WavBench.
Figure 4: Examples of Colloquial Expression in WavBench.
Figure 5: Visualization of static analysis of the Colloquial Expression Set in WavBench.
...and 5 more figures

WavBench: Benchmarking Reasoning, Colloquialism, and Paralinguistics for End-to-End Spoken Dialogue Models

TL;DR

Abstract

WavBench: Benchmarking Reasoning, Colloquialism, and Paralinguistics for End-to-End Spoken Dialogue Models

Authors

TL;DR

Abstract

Table of Contents

Figures (10)