VoxEval: Benchmarking the Knowledge Understanding Capabilities of End-to-End Spoken Language Models
Wenqian Cui, Xiaoqi Jiao, Ziqiao Meng, Irwin King
TL;DR
The paper identifies a gap in evaluating end-to-end speech knowledge understanding and presents VoxEval, a SpeechQA benchmark that preserves speech input/output and probes robustness across diverse audio conditions, including spoken mathematics. It constructs 13,938 SpeechQA pairs by converting MMLU questions to spoken form with TTS and augments evaluation with varied speakers, speaking styles, and audio qualities, plus a two-step spoken-math generation and CoT reasoning framework. Five open-source SLMs (SpeechGPT, TWIST, SPIRIT-LM, Moshi, GLM-4-Voice) are evaluated, revealing widespread difficulty and sensitivity to input conditions, with CoT sometimes hindering performance. VoxEval establishes a challenging benchmark for end-to-end SLMs and highlights essential directions for improving robustness, reasoning, and speech-based knowledge understanding in real-world scenarios.
Abstract
With the rising need for speech-based interaction models, end-to-end Spoken Language Models (SLMs) have emerged as a promising solution. While these models require comprehensive world knowledge for meaningful and reliable human interactions, existing question-answering (QA) benchmarks fall short in evaluating SLMs' knowledge understanding due to their inability to support end-to-end speech evaluation and account for varied input audio conditions. To address these limitations, we present VoxEval, a novel SpeechQA benchmark that assesses SLMs' knowledge understanding through pure speech interactions. Our benchmark 1) uniquely maintains speech format for both inputs and outputs, 2) evaluates model robustness across diverse input audio conditions, and 3) pioneers the assessment of complex tasks like mathematical reasoning in spoken format. Systematic evaluation demonstrates that VoxEval presents significant challenges to current SLMs, revealing their sensitivity to varying audio conditions and highlighting the need to enhance reasoning capabilities in future development. We hope this benchmark could guide the advancement of more sophisticated and reliable SLMs. VoxEval dataset is available at: https://github.com/dreamtheater123/VoxEval
