VisioMath: Benchmarking Figure-based Mathematical Reasoning in LMMs
Can Li, Ying Liu, Ting Zhang, Mei Wang, Hua Huang
TL;DR
VisioMath targets diagram-based multimodal reasoning in mathematics by evaluating how LMMs ground textual cues to closely similar diagrams. It constructs a benchmark of $1800$ items with $8{,}070$ diagram options, balancing bias and ensuring reliability. Across models, accuracy declines with increasing inter-image similarity, and image–text misalignment emerges as the dominant failure mode. Alignment-oriented strategies—consolidated layouts, explicit visual–text anchors, and multi-image CoT fine-tuning—yield substantial gains, illustrating how lightweight alignment interventions can push LMMs toward deeper diagram understanding and grounded cross-modal reasoning.
Abstract
Large Multimodal Models have achieved remarkable progress in integrating vision and language, enabling strong performance across perception, reasoning, and domain-specific tasks. However, their capacity to reason over multiple, visually similar inputs remains insufficiently explored. Such fine-grained comparative reasoning is central to real-world tasks, especially in mathematics and education, where learners must often distinguish between nearly identical diagrams to identify correct solutions. To address this gap, we present VisioMath, a curated benchmark of 1,800 high-quality K-12 mathematics problems in which all candidate answers are diagrams with subtle visual similarities. A comprehensive evaluation of state-of-the-art LMMs, covering both leading closed-source systems and widely adopted open-source models, reveals a consistent decline in accuracy as inter-image similarity increases. Analysis indicates that the dominant failure mode stems from image-text misalignment: rather than grounding reasoning in textual cues, models often resort to shallow positional heuristics, resulting in systematic errors. We further explore three alignment-oriented strategies, spanning training-free approaches and finetuning, and achieve substantial accuracy gains. We hope that VisioMath will serve as a rigorous benchmark and catalyst for developing LMMs toward deeper diagram understanding, precise comparative reasoning, and grounded multi-image-text integration.
