Table of Contents
Fetching ...

What MLLMs Learn about When they Learn about Multimodal Reasoning: Perception, Reasoning, or their Integration?

Jiwan Chung, Neel Joshi, Pratyusha Sharma, Youngjae Yu, Vibhav Vineet

TL;DR

MathLens introduces a multi-axial benchmark to diagnose multimodal reasoning in geometry by decomposing it into perception, reasoning, and integration using a formal semantic state $S_k$ and operator $\varphi_k$. The dataset provides aligned diagrams $C^{img}_k$, textual renderings $C^{txt}_k$, perception probes, and robust diagram variants, enabling controlled tests and automatic error analysis. Experimental results reveal that reinforcement learning mainly enhances perception (aided by textual reflective reasoning), while reasoning and especially integration lag behind, with integration remaining the dominant bottleneck. The study shows that robustness to diagram variations and cross-modal grounding depend strongly on training regime, and it offers concrete directions for improving integration, leveraging auxiliary supervision, and expanding atomic perception probes for durable multimodal reasoning. Overall, MathLens provides a reproducible, diagnostics-focused framework that aligns with, and extends, existing geometry benchmarks to illuminate subskill development in multimodal models.

Abstract

Multimodal reasoning models have recently shown promise on challenging domains such as olympiad-level geometry, yet their evaluation remains dominated by aggregate accuracy, a single score that obscures where and how models are improving. We introduce MathLens, a benchmark designed to disentangle the subskills of multimodal reasoning while preserving the complexity of textbook-style geometry problems. The benchmark separates performance into three components: Perception: extracting information from raw inputs, Reasoning: operating on available information, and Integration: selecting relevant perceptual evidence and applying it within reasoning. To support each test, we provide annotations: visual diagrams, textual descriptions to evaluate reasoning in isolation, controlled questions that require both modalities, and probes for fine-grained perceptual skills, all derived from symbolic specifications of the problems to ensure consistency and robustness. Our analysis reveals that different training approaches have uneven effects: First, reinforcement learning chiefly strengthens perception, especially when supported by textual supervision, while textual SFT indirectly improves perception through reflective reasoning. Second, reasoning improves only in tandem with perception. Third, integration remains the weakest capacity, with residual errors concentrated there once other skills advance. Finally, robustness diverges: RL improves consistency under diagram variation, whereas multimodal SFT reduces it through overfitting. We will release all data and experimental logs.

What MLLMs Learn about When they Learn about Multimodal Reasoning: Perception, Reasoning, or their Integration?

TL;DR

MathLens introduces a multi-axial benchmark to diagnose multimodal reasoning in geometry by decomposing it into perception, reasoning, and integration using a formal semantic state and operator . The dataset provides aligned diagrams , textual renderings , perception probes, and robust diagram variants, enabling controlled tests and automatic error analysis. Experimental results reveal that reinforcement learning mainly enhances perception (aided by textual reflective reasoning), while reasoning and especially integration lag behind, with integration remaining the dominant bottleneck. The study shows that robustness to diagram variations and cross-modal grounding depend strongly on training regime, and it offers concrete directions for improving integration, leveraging auxiliary supervision, and expanding atomic perception probes for durable multimodal reasoning. Overall, MathLens provides a reproducible, diagnostics-focused framework that aligns with, and extends, existing geometry benchmarks to illuminate subskill development in multimodal models.

Abstract

Multimodal reasoning models have recently shown promise on challenging domains such as olympiad-level geometry, yet their evaluation remains dominated by aggregate accuracy, a single score that obscures where and how models are improving. We introduce MathLens, a benchmark designed to disentangle the subskills of multimodal reasoning while preserving the complexity of textbook-style geometry problems. The benchmark separates performance into three components: Perception: extracting information from raw inputs, Reasoning: operating on available information, and Integration: selecting relevant perceptual evidence and applying it within reasoning. To support each test, we provide annotations: visual diagrams, textual descriptions to evaluate reasoning in isolation, controlled questions that require both modalities, and probes for fine-grained perceptual skills, all derived from symbolic specifications of the problems to ensure consistency and robustness. Our analysis reveals that different training approaches have uneven effects: First, reinforcement learning chiefly strengthens perception, especially when supported by textual supervision, while textual SFT indirectly improves perception through reflective reasoning. Second, reasoning improves only in tandem with perception. Third, integration remains the weakest capacity, with residual errors concentrated there once other skills advance. Finally, robustness diverges: RL improves consistency under diagram variation, whereas multimodal SFT reduces it through overfitting. We will release all data and experimental logs.

Paper Structure

This paper contains 63 sections, 11 equations, 22 figures, 4 tables.

Figures (22)

  • Figure 1: MathLens decomposes multimodal reasoning errors into perception, reasoning, and integration, revealing capacity-specific shifts after fine-tuning that are hidden by aggregate accuracy. Each training strategy affects capacities differently; e.g., text SFT yields a minor gain ($\nearrow$) in reasoning but harms ($\downarrow$) integration (model details in \ref{['sec:ax_impl']}).
  • Figure 2: Based on MATHLENS annotations, the joint Multimodal Reasoning Test (1) is decomposed into a Perception Test and a textual Reasoning Test. The Perception Test evaluates questions answerable directly from the diagram, such as reading an annotated angle (e.g., $\angle DFB$ in (2)). The Reasoning Test (3) replaces the diagram with a complete textual description (e.g., “There are triangle $BDF$ … $\angle DFB = 65^\circ$”), such that the question can be solved without visual access. Finally, Integration (4) highlights cases where multimodal reasoning fails even though perception and reasoning, when tested independently, succeed.
  • Figure 3: Sample data generation process in MathLens. From a semantic state representation, we build controlled text descriptions, perception probes, and questions with no overlap with visual content. Also, new diagrams are rendered from the semantic state to avoid visual familiarity effects.
  • Figure 4: Impact of multimodal reasoning training on MathLens performance. We evaluate pretrained backbones alongside models finetuned for multimodal reasoning tasks, reporting accuracy (%). MathLens is sensitive to gains from multimodal reasoning–oriented finetuning.
  • Figure 5: (left) Correlation of MathLens with popular benchmarks.MathLens shows high correlation with standard multimodal reasoning benchmarks. (right) Performance gains by input modality. Bars show percentage point shifts from finetuning for text versus diagram inputs. Visual gains exceed textual ones when models are primed with strong reasoning (textual SFT).
  • ...and 17 more figures