Table of Contents
Fetching ...

VocalBench: Benchmarking the Vocal Conversational Abilities for Speech Interaction Models

Heyang Liu, Yuhao Wang, Ziyang Cheng, Hongcheng Liu, Yiqi Li, Yixuan Hou, Ronghua Wu, Qunshan Gu, Yanfeng Wang, Yu Wang

TL;DR

VocalBench addresses the lack of comprehensive, multilingual evaluation for speech interaction models by introducing a holistic benchmark with ~24k English and Mandarin instances across 14 capability dimensions. It evaluates 27 mainstream models, including cascade pipelines, offline SpeechLLMs, Omni-LLMs, and real-time APIs, using a multi-faceted metric suite that covers semantic accuracy, acoustic naturalness, dialogue capability, safety, and robustness. Key findings show that while backbone scaling and MoE architectures boost semantic performance, gains in acoustic expressiveness and empathetic speech lag, and end-to-end models generally outperform cascaded approaches in robustness. The framework enables granular diagnosis of trade-offs and guides future improvements in speech token-to-waveform generation, paralinguistic control, and multilingual cross-lingual capabilities, with practical implications for deploying voice-enabled systems in real-world settings.

Abstract

Speech large language models (SpeechLLMs) have extended human-machine interactions from the text modality to the dynamic speech domain. Spoken dialogues convey diverse information, including semantic concepts, acoustic variations, paralanguage cues, and environmental context. However, existing evaluations of speech interaction models lack instances mimicking real scenarios and predominantly focus on the performance of distinct aspects, lacking a comprehensive comparison of critical capabilities between current routines. To address this gap, we propose VocalBench to assess the speech conversational abilities, comprising around 24k carefully curated instances of both English and Mandarin across four key dimensions - semantic quality, acoustic performance, conversational abilities, and robustness, covering 14 user-oriented characters. Experiments on 27 mainstream models reveal the common challenges for current routes, and highlight the need for new insights into next-generation speech interactive systems.

VocalBench: Benchmarking the Vocal Conversational Abilities for Speech Interaction Models

TL;DR

VocalBench addresses the lack of comprehensive, multilingual evaluation for speech interaction models by introducing a holistic benchmark with ~24k English and Mandarin instances across 14 capability dimensions. It evaluates 27 mainstream models, including cascade pipelines, offline SpeechLLMs, Omni-LLMs, and real-time APIs, using a multi-faceted metric suite that covers semantic accuracy, acoustic naturalness, dialogue capability, safety, and robustness. Key findings show that while backbone scaling and MoE architectures boost semantic performance, gains in acoustic expressiveness and empathetic speech lag, and end-to-end models generally outperform cascaded approaches in robustness. The framework enables granular diagnosis of trade-offs and guides future improvements in speech token-to-waveform generation, paralinguistic control, and multilingual cross-lingual capabilities, with practical implications for deploying voice-enabled systems in real-world settings.

Abstract

Speech large language models (SpeechLLMs) have extended human-machine interactions from the text modality to the dynamic speech domain. Spoken dialogues convey diverse information, including semantic concepts, acoustic variations, paralanguage cues, and environmental context. However, existing evaluations of speech interaction models lack instances mimicking real scenarios and predominantly focus on the performance of distinct aspects, lacking a comprehensive comparison of critical capabilities between current routines. To address this gap, we propose VocalBench to assess the speech conversational abilities, comprising around 24k carefully curated instances of both English and Mandarin across four key dimensions - semantic quality, acoustic performance, conversational abilities, and robustness, covering 14 user-oriented characters. Experiments on 27 mainstream models reveal the common challenges for current routes, and highlight the need for new insights into next-generation speech interactive systems.

Paper Structure

This paper contains 47 sections, 2 equations, 9 figures, 32 tables.

Figures (9)

  • Figure 1: Core capabilities of ideal speech interaction models, which are included in VocalBench.
  • Figure 2: The creation pipeline for VocalBench.
  • Figure 3: Dataset statistics for VocalBench. a. Public resources of VocalBench (without quantity information). b. The evaluation sets and proportions of VocalBench-en. c. The evaluation sets and proportions of VocalBench-zh.
  • Figure 4: The performance of representative models on VocalBench. left: VocalBench-en; right: VocalBench-zh.
  • Figure 5: The robustness performance on VocalBench-en. The dotted lines represent the scores in clean conditions.
  • ...and 4 more figures