Table of Contents
Fetching ...

Voice Evaluation of Reasoning Ability: Diagnosing the Modality-Induced Performance Gap

Yueqian Lin, Zhengmian Hu, Qinsi Wang, Yudong Liu, Hengfan Zhang, Jayakumar Subramanian, Nikos Vlassis, Hai Helen Li, Yiran Chen

TL;DR

The paper formalizes the Voice Reasoning Gap (VRG) and introduces VER A, a reproducible benchmark that evaluates reasoning in voice-interactive systems under real-time constraints across five tracks. It demonstrates a persistent gap between text-based reasoning and voice-based reasoning, with an average accuracy drop of 40.4 percentage points and larger declines on multi-step tasks (e.g., Math gap = 68.7 points). Through controlled experiments including decoupled LiveAnswer baselines and extended thinking time, the authors show that the gap persists despite architectural variations and improved audio fidelity, indicating a fundamental tension between streaming speech and iterative reasoning. The work also characterizes distinct failure signatures by architecture, providing a diagnostic framework to guide future innovations toward asynchronous or chunked reasoning paradigms. VER A offers a principled, cross-architecture evaluation tool to measure progress toward real-time voice assistants that are both fluent and reliably reasoned, with broad implications for designing next-generation conversational systems.

Abstract

We present Voice Evaluation of Reasoning Ability (VERA), a benchmark for evaluating reasoning ability in voice-interactive systems under real-time conversational constraints. VERA comprises 2,931 voice-native episodes derived from established text benchmarks and organized into five tracks (Math, Web, Science, Long-Context, Factual). Each item is adapted for speech interaction while preserving reasoning difficulty. VERA enables direct text-voice comparison within model families and supports analysis of how architectural choices affect reliability. We assess 12 contemporary voice systems alongside strong text baselines and observe large, consistent modality gaps: on competition mathematics a leading text model attains 74.8% accuracy while its voice counterpart reaches 6.1%; macro-averaged across tracks the best text models achieve 54.0% versus 11.3% for voice. Latency-accuracy analyses reveal a low-latency plateau, where fast voice systems cluster around ~10% accuracy, while approaching text performance requires sacrificing real-time interaction. Diagnostic experiments indicate that common mitigations are insufficient. Increasing "thinking time" yields negligible gains; a decoupled cascade that separates reasoning from narration improves accuracy but still falls well short of text and introduces characteristic grounding/consistency errors. Failure analyses further show distinct error signatures across native streaming, end-to-end, and cascade designs. VERA provides a reproducible testbed and targeted diagnostics for architectures that decouple thinking from speaking, offering a principled way to measure progress toward real-time voice assistants that are both fluent and reliably reasoned.

Voice Evaluation of Reasoning Ability: Diagnosing the Modality-Induced Performance Gap

TL;DR

The paper formalizes the Voice Reasoning Gap (VRG) and introduces VER A, a reproducible benchmark that evaluates reasoning in voice-interactive systems under real-time constraints across five tracks. It demonstrates a persistent gap between text-based reasoning and voice-based reasoning, with an average accuracy drop of 40.4 percentage points and larger declines on multi-step tasks (e.g., Math gap = 68.7 points). Through controlled experiments including decoupled LiveAnswer baselines and extended thinking time, the authors show that the gap persists despite architectural variations and improved audio fidelity, indicating a fundamental tension between streaming speech and iterative reasoning. The work also characterizes distinct failure signatures by architecture, providing a diagnostic framework to guide future innovations toward asynchronous or chunked reasoning paradigms. VER A offers a principled, cross-architecture evaluation tool to measure progress toward real-time voice assistants that are both fluent and reliably reasoned, with broad implications for designing next-generation conversational systems.

Abstract

We present Voice Evaluation of Reasoning Ability (VERA), a benchmark for evaluating reasoning ability in voice-interactive systems under real-time conversational constraints. VERA comprises 2,931 voice-native episodes derived from established text benchmarks and organized into five tracks (Math, Web, Science, Long-Context, Factual). Each item is adapted for speech interaction while preserving reasoning difficulty. VERA enables direct text-voice comparison within model families and supports analysis of how architectural choices affect reliability. We assess 12 contemporary voice systems alongside strong text baselines and observe large, consistent modality gaps: on competition mathematics a leading text model attains 74.8% accuracy while its voice counterpart reaches 6.1%; macro-averaged across tracks the best text models achieve 54.0% versus 11.3% for voice. Latency-accuracy analyses reveal a low-latency plateau, where fast voice systems cluster around ~10% accuracy, while approaching text performance requires sacrificing real-time interaction. Diagnostic experiments indicate that common mitigations are insufficient. Increasing "thinking time" yields negligible gains; a decoupled cascade that separates reasoning from narration improves accuracy but still falls well short of text and introduces characteristic grounding/consistency errors. Failure analyses further show distinct error signatures across native streaming, end-to-end, and cascade designs. VERA provides a reproducible testbed and targeted diagnostics for architectures that decouple thinking from speaking, offering a principled way to measure progress toward real-time voice assistants that are both fluent and reliably reasoned.

Paper Structure

This paper contains 42 sections, 2 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Latency–accuracy frontier on VERA. Markers show model performance (black circles: text, blue triangles: voice, purple square: LiveAnswer cascade) with x-axis as first response time (log scale) and y-axis as accuracy. The green Pareto frontier reveals a real-time reasoning desert: models achieving $\leq\!1.5$s response time (shaded band) plateau around 10% accuracy, while approaching the text upper bound ($\sim$54%, dashed line) requires sacrificing real-time interaction.
  • Figure 2: VERA at a glance. Five representative panels (Math, Web, Science, Long-Context, Factual) show how items are rewritten for voice while preserving reasoning difficulty.
  • Figure 3: Benchmark Construction Pipeline. From brainstorming to final audio generation through systematic filtering and quality control.
  • Figure 4: Modality patterns across model families. (a)-(b) Radar charts comparing text vs voice models within GPT and Gemini families across five tracks. (c) Horizontal bars showing Qwen voice model accuracy by track, with 10% and 20% reference lines.
  • Figure 5: LiveAnswer cascade latency. Stacked bars show STT (hatched) and LLM+TTS stages. Diamond marks user-perceived time to first audio. Mean latencies: $T_{\text{STT}}$=9.68s for speech recognition, $T_{\text{TTFR}_\text{partial}}$=0.83s from STT completion to first audio output, $T_{\text{LLM+TTS}}$=63.40s for complete reasoning and synthesis. Total end-to-end: $T_{\text{STT}}$ + $T_{\text{TTFR}_\text{partial}}$ + remaining generation.
  • ...and 1 more figures