Table of Contents
Fetching ...

RBench-V: A Primary Assessment for Visual Reasoning Models with Multi-modal Outputs

Meng-Hao Guo, Xuanyu Chu, Qianrui Yang, Zhe-Han Mo, Yiqing Shen, Pei-lin Li, Xinjie Lin, Jinnian Zhang, Xin-Sheng Chen, Yi Zhang, Kiyohiro Nakayama, Zhengyang Geng, Houwen Peng, Han Hu, Shi-Min Hu

TL;DR

RBench-V addresses the gap in evaluating visual reasoning models on their ability to generate multi-modal outputs during problem solving. It introduces 803 questions across math, physics, counting, and games that specifically require image creation or modification as part of reasoning. Empirical results show that even the strongest current model, o3, achieves only 25.8% accuracy versus human 82.3%, revealing a substantial gap in multi-modal reasoning capabilities. The benchmark provides an automated framework to track progress toward more capable omni-modal reasoning, highlighting the need for M-CoT and agent-based approaches to overcome current limitations.

Abstract

The rapid advancement of native multi-modal models and omni-models, exemplified by GPT-4o, Gemini, and o3, with their capability to process and generate content across modalities such as text and images, marks a significant milestone in the evolution of intelligence. Systematic evaluation of their multi-modal output capabilities in visual thinking processes (also known as multi-modal chain of thought, M-CoT) becomes critically important. However, existing benchmarks for evaluating multi-modal models primarily focus on assessing multi-modal inputs and text-only reasoning while neglecting the importance of reasoning through multi-modal outputs. In this paper, we present a benchmark, dubbed RBench-V, designed to assess models' vision-indispensable reasoning abilities. To construct RBench-V, we carefully hand-pick 803 questions covering math, physics, counting, and games. Unlike previous benchmarks that typically specify certain input modalities, RBench-V presents problems centered on multi-modal outputs, which require image manipulation such as generating novel images and constructing auxiliary lines to support the reasoning process. We evaluate numerous open- and closed-source models on RBench-V, including o3, Gemini 2.5 Pro, Qwen2.5-VL, etc. Even the best-performing model, o3, achieves only 25.8% accuracy on RBench-V, far below the human score of 82.3%, highlighting that current models struggle to leverage multi-modal reasoning. Data and code are available at https://evalmodels.github.io/rbenchv

RBench-V: A Primary Assessment for Visual Reasoning Models with Multi-modal Outputs

TL;DR

RBench-V addresses the gap in evaluating visual reasoning models on their ability to generate multi-modal outputs during problem solving. It introduces 803 questions across math, physics, counting, and games that specifically require image creation or modification as part of reasoning. Empirical results show that even the strongest current model, o3, achieves only 25.8% accuracy versus human 82.3%, revealing a substantial gap in multi-modal reasoning capabilities. The benchmark provides an automated framework to track progress toward more capable omni-modal reasoning, highlighting the need for M-CoT and agent-based approaches to overcome current limitations.

Abstract

The rapid advancement of native multi-modal models and omni-models, exemplified by GPT-4o, Gemini, and o3, with their capability to process and generate content across modalities such as text and images, marks a significant milestone in the evolution of intelligence. Systematic evaluation of their multi-modal output capabilities in visual thinking processes (also known as multi-modal chain of thought, M-CoT) becomes critically important. However, existing benchmarks for evaluating multi-modal models primarily focus on assessing multi-modal inputs and text-only reasoning while neglecting the importance of reasoning through multi-modal outputs. In this paper, we present a benchmark, dubbed RBench-V, designed to assess models' vision-indispensable reasoning abilities. To construct RBench-V, we carefully hand-pick 803 questions covering math, physics, counting, and games. Unlike previous benchmarks that typically specify certain input modalities, RBench-V presents problems centered on multi-modal outputs, which require image manipulation such as generating novel images and constructing auxiliary lines to support the reasoning process. We evaluate numerous open- and closed-source models on RBench-V, including o3, Gemini 2.5 Pro, Qwen2.5-VL, etc. Even the best-performing model, o3, achieves only 25.8% accuracy on RBench-V, far below the human score of 82.3%, highlighting that current models struggle to leverage multi-modal reasoning. Data and code are available at https://evalmodels.github.io/rbenchv

Paper Structure

This paper contains 19 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 2: The motivation of $\mathbf{\mathcal{R}}$Bench-V. Left: An illustration showing both humans and the GPT-4o model being asked a game-related question from $\mathbf{\mathcal{R}}$Bench-V. Right: This part shows common benchmarks such as MMLU, MMMU, and Rench focus on multi-modal inputs and textual outputs, whereas $\mathbf{\mathcal{R}}$Bench-V emphasizes not only multi-modal inputs but also multi-modal outputs.
  • Figure 3: A visual comparison with MMLU, MMMU and $\mathbf{\mathcal{R}}$Bench-V. It shows that solving problems in MMLU and MMMU mainly requires understanding multi-modal inputs and generating textual outputs, whereas solving problems in $\mathbf{\mathcal{R}}$Bench-V demands not only understanding multi-modal inputs but also generating multi-modal outputs. The red lines shown in the figure are not part of the original questions and represent the multi-modal reasoning process when solving problems in $\mathbf{\mathcal{R}}$Bench-V, such as drawing geometric shapes or tracing paths through a maze.
  • Figure 4: Examples of o3's responses to math and game questions in $\mathbf{\mathcal{R}}$Bench-V. Left: o3 correctly answers a math question in $\mathbf{\mathcal{R}}$Bench-V by transforming the geometry problem into an algebraic one using a coordinate system, whereas humans typically solve it using geometric methods. Right: o3 fails to answer a game question correctly. The blue highlights indicate the cause of the error and the key issue is that the model fails to follow the instructions to draw the required connections.