Table of Contents
Fetching ...

VisReason: A Large-Scale Dataset for Visual Chain-of-Thought Reasoning

Lingxiao Li, Yifan Wang, Xinyan Gao, Chen Tang, Xiangyu Yue, Chenyu You

TL;DR

VisReason introduces a large-scale, depth-aware visual CoT dataset (489K examples) and an expert subset (VisReason-Pro, 165K) to train multimodal LLMs for spatially grounded, multi-round reasoning. The authors fine-tune a Qwen2.5-VL-7B backbone with LoRA on VisReason and VisReason-Pro, achieving state-of-the-art performance on the Visual-CoT benchmark and strong generalization to external evaluation suites, due to depth cues and zoom–verify supervision. A dedicated data-generation pipeline provides scene-level rationales, RoIs, and 3D grounding signals that promote global-to-local reasoning and reduce shortcut learning. The work establishes a comprehensive benchmark, evaluation protocols, and prompts, positioning VisReason as a foundational resource for advancing human-like visual reasoning in multimodal systems.

Abstract

Chain-of-Thought (CoT) prompting has proven remarkably effective for eliciting complex reasoning in large language models (LLMs). Yet, its potential in multimodal large language models (MLLMs) remains largely untapped, hindered by the absence of large-scale datasets that capture the rich, spatially grounded reasoning intrinsic to visual understanding. Existing visual-CoT resources are typically small, domain-specific, or lack the human-like stepwise structure necessary for compositional visual reasoning. In this paper, we introduce VisReason, a large-scale dataset designed to advance visual Chain-of-Thought reasoning. VisReason comprises 489K annotated examples spanning four diverse domains, each featuring multi-round, human-like rationales that guide MLLMs through interpretable visual reasoning steps. Building upon this, we curate VisReason-Pro, a 165K subset produced with a stronger expert-level GPT annotator, enriched with detailed reasoning traces and 3D spatial grounding via depth-informed annotations. Fine-tuning the state-of-the-art Qwen2.5-VL model on VisReason and VisReason-Pro yields substantial improvements in step-by-step visual reasoning accuracy, interpretability, and cross-benchmark generalization. These results demonstrate that VisReason equips MLLMs with more systematic and generalizable reasoning capabilities. We envision VisReason as a cornerstone for cultivating human-like visual reasoning, paving the way toward the next generation of multimodal intelligence.

VisReason: A Large-Scale Dataset for Visual Chain-of-Thought Reasoning

TL;DR

VisReason introduces a large-scale, depth-aware visual CoT dataset (489K examples) and an expert subset (VisReason-Pro, 165K) to train multimodal LLMs for spatially grounded, multi-round reasoning. The authors fine-tune a Qwen2.5-VL-7B backbone with LoRA on VisReason and VisReason-Pro, achieving state-of-the-art performance on the Visual-CoT benchmark and strong generalization to external evaluation suites, due to depth cues and zoom–verify supervision. A dedicated data-generation pipeline provides scene-level rationales, RoIs, and 3D grounding signals that promote global-to-local reasoning and reduce shortcut learning. The work establishes a comprehensive benchmark, evaluation protocols, and prompts, positioning VisReason as a foundational resource for advancing human-like visual reasoning in multimodal systems.

Abstract

Chain-of-Thought (CoT) prompting has proven remarkably effective for eliciting complex reasoning in large language models (LLMs). Yet, its potential in multimodal large language models (MLLMs) remains largely untapped, hindered by the absence of large-scale datasets that capture the rich, spatially grounded reasoning intrinsic to visual understanding. Existing visual-CoT resources are typically small, domain-specific, or lack the human-like stepwise structure necessary for compositional visual reasoning. In this paper, we introduce VisReason, a large-scale dataset designed to advance visual Chain-of-Thought reasoning. VisReason comprises 489K annotated examples spanning four diverse domains, each featuring multi-round, human-like rationales that guide MLLMs through interpretable visual reasoning steps. Building upon this, we curate VisReason-Pro, a 165K subset produced with a stronger expert-level GPT annotator, enriched with detailed reasoning traces and 3D spatial grounding via depth-informed annotations. Fine-tuning the state-of-the-art Qwen2.5-VL model on VisReason and VisReason-Pro yields substantial improvements in step-by-step visual reasoning accuracy, interpretability, and cross-benchmark generalization. These results demonstrate that VisReason equips MLLMs with more systematic and generalizable reasoning capabilities. We envision VisReason as a cornerstone for cultivating human-like visual reasoning, paving the way toward the next generation of multimodal intelligence.

Paper Structure

This paper contains 35 sections, 7 equations, 22 figures, 9 tables, 2 algorithms.

Figures (22)

  • Figure 1: An MLLM fine-tuned on VisReason emulates a human-like visual reasoning process to solve a complex query. Rather than processing the entire image uniformly, the model adopts a dynamic global-to-local workflow: it first assesses the overall scene, then progressively focuses on salient regions to collect fine-grained visual evidence. This multi-step, spatially grounded visual Chain-of-Thought allows the model to anchor its reasoning in concrete visual cues, enabling accurate solutions to complex spatial problems that challenge conventional approaches. (Zoom in for better visibility.)
  • Figure 2: For each image-question pair, we provide a region of interest (bounding box) and a compact multi-round visual chain-of-thought: each round offers a scene sketch, an optional zoom to a predicted RoI, and a brief rationale. When available, depth cues indicate ordinal ordering. The annotations are concise and process-oriented, enabling spatially grounded reasoning on fine details and complex relations.
  • Figure 3: Pipeline for VisReason and VisReason-Pro data generation and supervision. Given an input image, we derive semantic segments and monocular depth to form an object list with categories, bounding boxes, and ordinal depth; a generator then produces a 3D-aware QA pair and target box. A second stage emits a compact, multi-round visual CoT -- scene sketch, predicted RoI, and rationale -- while iteratively zooming and verifying (with RoI/answer fix) until the final answer and finalized annotations are obtained.
  • Figure 4: Statistics of the proposed VisReason dataset. We report the distribution of CoT rounds (1–4), the average bounding-box size, and the average response length per round for each source dataset, showing that VisReason offers rich multi-round supervision and consistently long, detailed reasoning steps across diverse domains.
  • Figure 5: Overview of VisReason paradigm. The model iteratively processes the query by first generating a textual rationale and a bounding box for the next region of interest. It then crops the original image to this region, extracts new visual features, and appends them to the context to inform the next reasoning step, creating a zoom-and-verify sequence.
  • ...and 17 more figures