Table of Contents
Fetching ...

VisAidMath: Benchmarking Visual-Aided Mathematical Reasoning

Jingkun Ma, Runzhe Zhan, Yang Li, Di Sun, Hou Pong Chan, Lidia S. Chao, Derek F. Wong

TL;DR

VisAidMath introduces a benchmark and Three-Layered Funnel Evaluation framework to advance visual-aided mathematical reasoning beyond final-answer accuracy. The benchmark comprises 1,200 problems with explicit visual-context and visual-aid generation components, structured as GR, Direct VAR (D-VAR), and Indirect VAR (I-VAR). Experiments across a broad set of LLMs and LMMs reveal a pervasive Reasoning Illusion: models achieve high ACCU but show substantial declines in Process-Verified Accuracy (PVA) and Solution Process Robustness Score (SPRS), especially on tasks requiring direct visual engagement. The work demonstrates a fundamental gap between visual perception and logical deduction in current multi-modal models, provides an evaluation platform (CodaBench), and highlights critical directions for developing more reliable visual-grounded reasoning systems. Overall, VisAidMath shifts the evaluation emphasis from endpoint accuracy to verifiable reasoning quality and visual-inference capabilities, with broad implications for robust multimodal AI.

Abstract

A hallmark of advanced artificial intelligence is the capacity to progress from passive visual perception to the strategic modification of visual information to facilitate complex reasoning. This advanced capability, however, remains critically underdeveloped in current Large Multi-modal Models (LMMs). The deficiency is often masked by evaluation metrics that prioritize final-answer accuracy, creating an illusion of competence where genuine reasoning is absent. Using the domain of geometric problem-solving as a precise instrument, we probe this issue through tasks that require constructing visual aids. To this end, we introduce \textbf{VisAidMath}, a challenging benchmark, and our novel Three-Layered Funnel Evaluation Framework. This framework moves beyond simple accuracy (ACCU) to scrutinize the generation of valid visual aids (PVA) and the soundness of subsequent reasoning steps (SPRS). Our extensive experiments on state-of-the-art models, including Doubao-Seed-1.6 and o4, reveal a profound ``Reasoning Illusion''. We observe that high surface-level accuracy conceals a catastrophic failure in the models' ability to produce valid visual aids or to reason from them. Our findings expose a fundamental schism between visual perception and logical deduction in modern LMMs. We host an evaluation platform at CodaBench for testing publicly. Homepage: https://nlp2ct.github.io/VisAidMathHomepage/ Evaluation: https://www.codabench.org/competitions/7634/

VisAidMath: Benchmarking Visual-Aided Mathematical Reasoning

TL;DR

VisAidMath introduces a benchmark and Three-Layered Funnel Evaluation framework to advance visual-aided mathematical reasoning beyond final-answer accuracy. The benchmark comprises 1,200 problems with explicit visual-context and visual-aid generation components, structured as GR, Direct VAR (D-VAR), and Indirect VAR (I-VAR). Experiments across a broad set of LLMs and LMMs reveal a pervasive Reasoning Illusion: models achieve high ACCU but show substantial declines in Process-Verified Accuracy (PVA) and Solution Process Robustness Score (SPRS), especially on tasks requiring direct visual engagement. The work demonstrates a fundamental gap between visual perception and logical deduction in current multi-modal models, provides an evaluation platform (CodaBench), and highlights critical directions for developing more reliable visual-grounded reasoning systems. Overall, VisAidMath shifts the evaluation emphasis from endpoint accuracy to verifiable reasoning quality and visual-inference capabilities, with broad implications for robust multimodal AI.

Abstract

A hallmark of advanced artificial intelligence is the capacity to progress from passive visual perception to the strategic modification of visual information to facilitate complex reasoning. This advanced capability, however, remains critically underdeveloped in current Large Multi-modal Models (LMMs). The deficiency is often masked by evaluation metrics that prioritize final-answer accuracy, creating an illusion of competence where genuine reasoning is absent. Using the domain of geometric problem-solving as a precise instrument, we probe this issue through tasks that require constructing visual aids. To this end, we introduce \textbf{VisAidMath}, a challenging benchmark, and our novel Three-Layered Funnel Evaluation Framework. This framework moves beyond simple accuracy (ACCU) to scrutinize the generation of valid visual aids (PVA) and the soundness of subsequent reasoning steps (SPRS). Our extensive experiments on state-of-the-art models, including Doubao-Seed-1.6 and o4, reveal a profound ``Reasoning Illusion''. We observe that high surface-level accuracy conceals a catastrophic failure in the models' ability to produce valid visual aids or to reason from them. Our findings expose a fundamental schism between visual perception and logical deduction in modern LMMs. We host an evaluation platform at CodaBench for testing publicly. Homepage: https://nlp2ct.github.io/VisAidMathHomepage/ Evaluation: https://www.codabench.org/competitions/7634/

Paper Structure

This paper contains 73 sections, 23 equations, 28 figures, 43 tables.

Figures (28)

  • Figure 1: Comparison between VisAidMath and other benchmarks. Our work particularly focuses on utilization of explicit and implicit visual context during reasoning process.
  • Figure 2: Accuracies of all LMM on visual-aided mathematical reasoning task across four branches and six visual aids.
  • Figure 3: Comparison of different tasks: a) General Reasoning: provide MPS reasoning steps directly. b) Direct Visual-Aided Reasoning: create visual aids that disclose implicit visual context within problem, incorporating with textual reasoning to solve mathematical problem. c) Indirect Reasoning: solve the mathematical problem based on given visual aids. Direct visual-aided reasoning require the model to perform visual reasoning for visual aids generation.
  • Figure 4: Performance degradation from surface accuracy (ACCU) to process-level evaluation. The Reliability Gap (a) measures the proportion of correct answers with procedurally invalid reasoning. The Robustness Gap (b) measures the total drop in solution quality. Both gaps are most pronounced in the Direct Visual-aided Reasoning (D-VAR) task, highlighting its unique challenge.
  • Figure 5: Qualitative diagnosis of the reasoning gap.
  • ...and 23 more figures