VoxEval: Benchmarking the Knowledge Understanding Capabilities of End-to-End Spoken Language Models

Wenqian Cui; Xiaoqi Jiao; Ziqiao Meng; Irwin King

VoxEval: Benchmarking the Knowledge Understanding Capabilities of End-to-End Spoken Language Models

Wenqian Cui, Xiaoqi Jiao, Ziqiao Meng, Irwin King

TL;DR

The paper identifies a gap in evaluating end-to-end speech knowledge understanding and presents VoxEval, a SpeechQA benchmark that preserves speech input/output and probes robustness across diverse audio conditions, including spoken mathematics. It constructs 13,938 SpeechQA pairs by converting MMLU questions to spoken form with TTS and augments evaluation with varied speakers, speaking styles, and audio qualities, plus a two-step spoken-math generation and CoT reasoning framework. Five open-source SLMs (SpeechGPT, TWIST, SPIRIT-LM, Moshi, GLM-4-Voice) are evaluated, revealing widespread difficulty and sensitivity to input conditions, with CoT sometimes hindering performance. VoxEval establishes a challenging benchmark for end-to-end SLMs and highlights essential directions for improving robustness, reasoning, and speech-based knowledge understanding in real-world scenarios.

Abstract

With the rising need for speech-based interaction models, end-to-end Spoken Language Models (SLMs) have emerged as a promising solution. While these models require comprehensive world knowledge for meaningful and reliable human interactions, existing question-answering (QA) benchmarks fall short in evaluating SLMs' knowledge understanding due to their inability to support end-to-end speech evaluation and account for varied input audio conditions. To address these limitations, we present VoxEval, a novel SpeechQA benchmark that assesses SLMs' knowledge understanding through pure speech interactions. Our benchmark 1) uniquely maintains speech format for both inputs and outputs, 2) evaluates model robustness across diverse input audio conditions, and 3) pioneers the assessment of complex tasks like mathematical reasoning in spoken format. Systematic evaluation demonstrates that VoxEval presents significant challenges to current SLMs, revealing their sensitivity to varying audio conditions and highlighting the need to enhance reasoning capabilities in future development. We hope this benchmark could guide the advancement of more sophisticated and reliable SLMs. VoxEval dataset is available at: https://github.com/dreamtheater123/VoxEval

VoxEval: Benchmarking the Knowledge Understanding Capabilities of End-to-End Spoken Language Models

TL;DR

Abstract

Paper Structure (25 sections, 2 equations, 6 figures, 15 tables, 1 algorithm)

This paper contains 25 sections, 2 equations, 6 figures, 15 tables, 1 algorithm.

Introduction
Related Work
Speech Large Language Models
SpeechLLM Evaluation Benchmarks
Knowledge Understanding of LLMs
VoxEval
Data Construction
Various Input Conditions
Different Speakers
Different Speaking Styles
Different Audio Qualities
Handle Math Expressions and More
Experiments
Experimental Setups
Results
...and 10 more sections

Figures (6)

Figure 1: Illustration of the limitations and challenges of SLM knowledge comprehension evaluation.
Figure 2: Box plots to display the maximum performance score differences across different settings.
Figure 3: Wilcoxon signed-rank test results for the four SLMs showing statistically significant differences in the Friedman test.
Figure 4: The prompt for GPT-4o to convert questions with math expressions.
Figure 5: The prompt for GPT-4o to convert answer choices with math expressions.
...and 1 more figures

Theorems & Definitions (2)

Remark 1
Remark 2

VoxEval: Benchmarking the Knowledge Understanding Capabilities of End-to-End Spoken Language Models

TL;DR

Abstract

VoxEval: Benchmarking the Knowledge Understanding Capabilities of End-to-End Spoken Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (6)

Theorems & Definitions (2)