Table of Contents
Fetching ...

When Visualizing is the First Step to Reasoning: MIRA, a Benchmark for Visual Chain-of-Thought

Yiyang Zhou, Haoqin Tu, Zijun Wang, Zeyu Wang, Niklas Muennighoff, Fan Nie, Yejin Choi, James Zou, Chaorui Deng, Shen Yan, Haoqi Fan, Cihang Xie, Huaxiu Yao, Qinghao Ye

TL;DR

MIRA introduces a benchmark to evaluate reasoning that relies on generating intermediate visual representations. By offering 546 multimodal problems across 20 task types and a three-level evaluation protocol (Direct, Text-CoT, Visual-CoT) with annotated visuals, it isolates the contribution of visuals to reasoning accuracy. Results show that current multimodal models struggle with direct inputs, but Visual-CoT yields substantial gains (average ~33.7%), while Text-CoT often underperforms for strong models, underscoring the critical role of imagined visuals. The findings highlight a gap between existing closed-source and open-weight models and argue for unified multimodal training that integrates vision and reasoning in a think-while-drawing paradigm, with MIRA providing a reproducible platform for progress.

Abstract

We propose MIRA, a new benchmark designed to evaluate models in scenarios where generating intermediate visual images is essential for successful reasoning. Unlike traditional CoT methods that rely solely on text, tasks in MIRA require models to generate and utilize intermediate images - such as sketches, structural diagrams, or path drawings - to guide their reasoning process. This setup closely mirrors how humans solve complex problems through "drawing to think". To solve this, MIRA focuses on tasks that are intrinsically challenging and involve complex structures, spatial relationships, or reasoning steps that are difficult to express through language alone. To ensure that our evaluation data is of high-quality, we include 546 multimodal problems, annotated with intermediate visual images and final answers. We also propose a unified evaluation protocol for MIRA that spans three levels of evaluation input: direct input with image and question only, text-only CoT input with image and thinking prompts, and Visual-CoT input with both annotated image clues and textual thinking prompts. To probe the upper bound of model capacity on our benchmark, we also report pass@k and majority voting accuracies under different k settings. Experimental results show that existing multimodal large language models, including strongest private models as well as strong open-weight models, perform poorly when relying solely on textual prompts. However, when intermediate visual cues are provided, model performance improves consistently, yielding an average relative gain of 33.7% across all models and tasks. We also probe the upper bound by expanding the search space and designing textual prompts aligned with Visual-CoT, but both yield only limited improvements compared to our Visual-CoT setting. These results underscore the critical role of imagined visual information in enabling successful reasoning on MIRA.

When Visualizing is the First Step to Reasoning: MIRA, a Benchmark for Visual Chain-of-Thought

TL;DR

MIRA introduces a benchmark to evaluate reasoning that relies on generating intermediate visual representations. By offering 546 multimodal problems across 20 task types and a three-level evaluation protocol (Direct, Text-CoT, Visual-CoT) with annotated visuals, it isolates the contribution of visuals to reasoning accuracy. Results show that current multimodal models struggle with direct inputs, but Visual-CoT yields substantial gains (average ~33.7%), while Text-CoT often underperforms for strong models, underscoring the critical role of imagined visuals. The findings highlight a gap between existing closed-source and open-weight models and argue for unified multimodal training that integrates vision and reasoning in a think-while-drawing paradigm, with MIRA providing a reproducible platform for progress.

Abstract

We propose MIRA, a new benchmark designed to evaluate models in scenarios where generating intermediate visual images is essential for successful reasoning. Unlike traditional CoT methods that rely solely on text, tasks in MIRA require models to generate and utilize intermediate images - such as sketches, structural diagrams, or path drawings - to guide their reasoning process. This setup closely mirrors how humans solve complex problems through "drawing to think". To solve this, MIRA focuses on tasks that are intrinsically challenging and involve complex structures, spatial relationships, or reasoning steps that are difficult to express through language alone. To ensure that our evaluation data is of high-quality, we include 546 multimodal problems, annotated with intermediate visual images and final answers. We also propose a unified evaluation protocol for MIRA that spans three levels of evaluation input: direct input with image and question only, text-only CoT input with image and thinking prompts, and Visual-CoT input with both annotated image clues and textual thinking prompts. To probe the upper bound of model capacity on our benchmark, we also report pass@k and majority voting accuracies under different k settings. Experimental results show that existing multimodal large language models, including strongest private models as well as strong open-weight models, perform poorly when relying solely on textual prompts. However, when intermediate visual cues are provided, model performance improves consistently, yielding an average relative gain of 33.7% across all models and tasks. We also probe the upper bound by expanding the search space and designing textual prompts aligned with Visual-CoT, but both yield only limited improvements compared to our Visual-CoT setting. These results underscore the critical role of imagined visual information in enabling successful reasoning on MIRA.

Paper Structure

This paper contains 16 sections, 15 figures, 10 tables.

Figures (15)

  • Figure 1: Left: an example from MIRA with responses from both MLLMs and humans, illustrating the visual reasoning and cognitive gaps revealed by our benchmark; Right: while leading MLLMs demonstrate strong performance on established benchmarks, they struggle significantly on the MIRA, with none surpassing a 20% accuracy rate with direct inputs. This highlights MIRA's role in exposing the fundamental challenges these models face in complex reasoning tasks that require generating intermediate visual images.
  • Figure 2: MIRA categorizes Visual-CoT reasoning tasks into two primary types: Static (Single-Step) and Dynamic (Multi-Step), with representative examples from each category illustrated in the figure. The dataset includes 20 types of tasks, 546 input images with manually designed questions, and 936 manually constructed single-step and multi-step intermediate images. For more cases, please refer to Appendix \ref{['app:dataset']}.
  • Figure 3: A high-level overview of the MIRA data design and construction pipeline.
  • Figure 4: A comprehensive performance comparison of leading models across three evaluation settings: Direct Evaluation (D), Text-CoT Reasoning (T), and Simulated Visual-CoT Reasoning (V). This stacked bar chart shows performance scaling: the base indicates pass@1 accuracy, with segments above capturing gains from pass@2, pass@4, and pass@8. The red horizontal marks show majority voting scores over 8 responses.
  • Figure 5: A representative failure case of Text-CoT on a Euclidean Geometry (EG) reasoning task. Even the strongest model (GPT-5) struggles to correctly reason through the problem using plain text, due to its inability to manipulate intermediate visual states. In contrast, the Visual-CoT approach, which leverages intermediate visualizations, enables more accurate localization of the overlapping region and correct counting of red points.
  • ...and 10 more figures