Table of Contents
Fetching ...

Towards Spoken Mathematical Reasoning: Benchmarking Speech-based Models over Multi-faceted Math Problems

Chengwei Wei, Bin Wang, Jung-jae Kim, Nancy F. Chen

TL;DR

This work tackles the challenge of evaluating mathematical reasoning from spoken input by introducing Spoken-MQA, a benchmark spanning arithmetic, contextual, and knowledge-oriented reasoning. It builds a diverse dataset through a pipeline that verbalizes problems with GPT-4o, filters ambiguity with human input, and synthesizes speech via TTS, enabling evaluation of cascade (ASR+LLM) and end-to-end speech LLM architectures. Key findings show cascade models generally outperform end-to-end speech LLMs, especially in arithmetic and knowledge-driven tasks, and reveal a strong bias toward LaTeX-style symbolic representations over verbalized expressions. The study also demonstrates that domain-specific fine-tuning can meaningfully improve performance on spoken math problems, underscoring the importance of speech-focused domain adaptation for robust spoken mathematical reasoning.

Abstract

Recent advances in large language models (LLMs) and multimodal LLMs (MLLMs) have led to strong reasoning ability across a wide range of tasks. However, their ability to perform mathematical reasoning from spoken input remains underexplored. Prior studies on speech modality have mostly focused on factual speech understanding or simple audio reasoning tasks, providing limited insight into logical step-by-step reasoning, such as that required for mathematical problem solving. To address this gap, we introduce Spoken Math Question Answering (Spoken-MQA), a new benchmark designed to evaluate the mathematical reasoning capabilities of speech-based models, including both cascade models (ASR + LLMs) and end-to-end speech LLMs. Spoken-MQA covers a diverse set of math problems, including pure arithmetic, single-step and multi-step contextual reasoning, and knowledge-oriented reasoning problems, all presented in unambiguous natural spoken language. Through extensive experiments, we find that: (1) while some speech LLMs perform competitively on contextual reasoning tasks involving basic arithmetic, they still struggle with direct arithmetic problems; (2) current LLMs exhibit a strong bias toward symbolic mathematical expressions written in LaTex and have difficulty interpreting verbalized mathematical expressions; and (3) mathematical knowledge reasoning abilities are significantly degraded in current speech LLMs.

Towards Spoken Mathematical Reasoning: Benchmarking Speech-based Models over Multi-faceted Math Problems

TL;DR

This work tackles the challenge of evaluating mathematical reasoning from spoken input by introducing Spoken-MQA, a benchmark spanning arithmetic, contextual, and knowledge-oriented reasoning. It builds a diverse dataset through a pipeline that verbalizes problems with GPT-4o, filters ambiguity with human input, and synthesizes speech via TTS, enabling evaluation of cascade (ASR+LLM) and end-to-end speech LLM architectures. Key findings show cascade models generally outperform end-to-end speech LLMs, especially in arithmetic and knowledge-driven tasks, and reveal a strong bias toward LaTeX-style symbolic representations over verbalized expressions. The study also demonstrates that domain-specific fine-tuning can meaningfully improve performance on spoken math problems, underscoring the importance of speech-focused domain adaptation for robust spoken mathematical reasoning.

Abstract

Recent advances in large language models (LLMs) and multimodal LLMs (MLLMs) have led to strong reasoning ability across a wide range of tasks. However, their ability to perform mathematical reasoning from spoken input remains underexplored. Prior studies on speech modality have mostly focused on factual speech understanding or simple audio reasoning tasks, providing limited insight into logical step-by-step reasoning, such as that required for mathematical problem solving. To address this gap, we introduce Spoken Math Question Answering (Spoken-MQA), a new benchmark designed to evaluate the mathematical reasoning capabilities of speech-based models, including both cascade models (ASR + LLMs) and end-to-end speech LLMs. Spoken-MQA covers a diverse set of math problems, including pure arithmetic, single-step and multi-step contextual reasoning, and knowledge-oriented reasoning problems, all presented in unambiguous natural spoken language. Through extensive experiments, we find that: (1) while some speech LLMs perform competitively on contextual reasoning tasks involving basic arithmetic, they still struggle with direct arithmetic problems; (2) current LLMs exhibit a strong bias toward symbolic mathematical expressions written in LaTex and have difficulty interpreting verbalized mathematical expressions; and (3) mathematical knowledge reasoning abilities are significantly degraded in current speech LLMs.

Paper Structure

This paper contains 15 sections, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Overview of Spoken-MQA.
  • Figure 2: Pipeline for Generating and Filtering Unambiguous Verbal Math Questions. The process involves verbalizing math problems with LaTeX mathematical expressions using GPT-4o, followed by ambiguity detection through both GPT-4o and human verification.
  • Figure 3: Model Accuracy on Short vs. Long Digit Length in Arithmetic
  • Figure 4: Instruction for generating verbalized math questions and assessing their ambiguity