Table of Contents
Fetching ...

VERIFY: A Benchmark of Visual Explanation and Reasoning for Investigating Multimodal Reasoning Fidelity

Jing Bi, Junjia Guo, Susan Liang, Guangyu Sun, Luchuan Song, Yunlong Tang, Jinxi He, Jiarui Wu, Ali Vosoughi, Chen Chen, Chenliang Xu

TL;DR

VERIFY targets the gap between perceptual accuracy and true visual reasoning by curating hard, real-world visual puzzles with ground-truth, human-annotated reasoning paths. It introduces an evaluation framework that decouples perception from reasoning stages (Recognition, Abstraction, Deduction) and adds perception-based metrics to measure fidelity beyond final answers. Across open-source and proprietary MLLMs, results reveal persistent deficits in visual reasoning fidelity, with accuracy around 21.7% and notable biases in perception vs. reasoning, calling for more balanced, cognitive-aware model development. The benchmark, including the data, reasoning trajectories, and evaluation protocols, provides a scalable, interpretable tool for diagnosing and guiding improvements in multimodal reasoning systems.

Abstract

Visual reasoning is central to human cognition, enabling individuals to interpret and abstractly understand their environment. Although recent Multimodal Large Language Models (MLLMs) have demonstrated impressive performance across language and vision-language tasks, existing benchmarks primarily measure recognition-based skills and inadequately assess true visual reasoning capabilities. To bridge this critical gap, we introduce VERIFY, a benchmark explicitly designed to isolate and rigorously evaluate the visual reasoning capabilities of state-of-the-art MLLMs. VERIFY compels models to reason primarily from visual information, providing minimal textual context to reduce reliance on domain-specific knowledge and linguistic biases. Each problem is accompanied by a human-annotated reasoning path, making it the first to provide in-depth evaluation of model decision-making processes. Additionally, we propose novel metrics that assess visual reasoning fidelity beyond mere accuracy, highlighting critical imbalances in current model reasoning patterns. Our comprehensive benchmarking of leading MLLMs uncovers significant limitations, underscoring the need for a balanced and holistic approach to both perception and reasoning. For more teaser and testing, visit our project page (https://verify-eqh.pages.dev/).

VERIFY: A Benchmark of Visual Explanation and Reasoning for Investigating Multimodal Reasoning Fidelity

TL;DR

VERIFY targets the gap between perceptual accuracy and true visual reasoning by curating hard, real-world visual puzzles with ground-truth, human-annotated reasoning paths. It introduces an evaluation framework that decouples perception from reasoning stages (Recognition, Abstraction, Deduction) and adds perception-based metrics to measure fidelity beyond final answers. Across open-source and proprietary MLLMs, results reveal persistent deficits in visual reasoning fidelity, with accuracy around 21.7% and notable biases in perception vs. reasoning, calling for more balanced, cognitive-aware model development. The benchmark, including the data, reasoning trajectories, and evaluation protocols, provides a scalable, interpretable tool for diagnosing and guiding improvements in multimodal reasoning systems.

Abstract

Visual reasoning is central to human cognition, enabling individuals to interpret and abstractly understand their environment. Although recent Multimodal Large Language Models (MLLMs) have demonstrated impressive performance across language and vision-language tasks, existing benchmarks primarily measure recognition-based skills and inadequately assess true visual reasoning capabilities. To bridge this critical gap, we introduce VERIFY, a benchmark explicitly designed to isolate and rigorously evaluate the visual reasoning capabilities of state-of-the-art MLLMs. VERIFY compels models to reason primarily from visual information, providing minimal textual context to reduce reliance on domain-specific knowledge and linguistic biases. Each problem is accompanied by a human-annotated reasoning path, making it the first to provide in-depth evaluation of model decision-making processes. Additionally, we propose novel metrics that assess visual reasoning fidelity beyond mere accuracy, highlighting critical imbalances in current model reasoning patterns. Our comprehensive benchmarking of leading MLLMs uncovers significant limitations, underscoring the need for a balanced and holistic approach to both perception and reasoning. For more teaser and testing, visit our project page (https://verify-eqh.pages.dev/).

Paper Structure

This paper contains 22 sections, 1 equation, 23 figures, 4 tables.

Figures (23)

  • Figure 1: This example demonstrates that current MLLMs primarily depend on straightforward visual signals (e.g., letters) for reasoning, frequently neglecting patterns based on other characteristics, such as shapes or line properties. VERIFY delivers human-annotated reasoning paths to enhance the evaluation and comprehension of why and when models fail.
  • Figure 1: Incorrect reasoning paths still lead to the correct answer
  • Figure 2: Categories from the VERIFY dataset cover a range of patterns, from logical operations to 3D geometry and mathematics. The right panel presents a human reasoning path, demonstrating how visual transformations, rotations, and inside-outside shifts lead to the final answer. We encourage readers to test these examples with MLLM models (e.g., o1 or Gemini ) to assess their reasoning capabilities.
  • Figure 2: With human reasoning favoring option A due to a decreasing dot pattern. o1 incorrectly identifies the answer due to miscounting and failing to verify against options, while Qwen2.5 accidentally selects the correct answer despite a miscount.
  • Figure 3: We divide the reasoning process into four key stages inspired by human visual reasoning: perception, recognition, abstraction, and deduction. Unlike general visual tasks, where perception involves detecting raw visual features, humans often have implicit perception because the provided visual elements are already structured for direct recognition of useful components. Even for shown complex problems, a model with strong visual abilities—like Gemini—can effectively analyze patterns and logical structures to determine the correct answer.
  • ...and 18 more figures