VisReason: A Large-Scale Dataset for Visual Chain-of-Thought Reasoning

Lingxiao Li; Yifan Wang; Xinyan Gao; Chen Tang; Xiangyu Yue; Chenyu You

VisReason: A Large-Scale Dataset for Visual Chain-of-Thought Reasoning

Lingxiao Li, Yifan Wang, Xinyan Gao, Chen Tang, Xiangyu Yue, Chenyu You

TL;DR

VisReason introduces a large-scale, depth-aware visual CoT dataset (489K examples) and an expert subset (VisReason-Pro, 165K) to train multimodal LLMs for spatially grounded, multi-round reasoning. The authors fine-tune a Qwen2.5-VL-7B backbone with LoRA on VisReason and VisReason-Pro, achieving state-of-the-art performance on the Visual-CoT benchmark and strong generalization to external evaluation suites, due to depth cues and zoom–verify supervision. A dedicated data-generation pipeline provides scene-level rationales, RoIs, and 3D grounding signals that promote global-to-local reasoning and reduce shortcut learning. The work establishes a comprehensive benchmark, evaluation protocols, and prompts, positioning VisReason as a foundational resource for advancing human-like visual reasoning in multimodal systems.

Abstract

Chain-of-Thought (CoT) prompting has proven remarkably effective for eliciting complex reasoning in large language models (LLMs). Yet, its potential in multimodal large language models (MLLMs) remains largely untapped, hindered by the absence of large-scale datasets that capture the rich, spatially grounded reasoning intrinsic to visual understanding. Existing visual-CoT resources are typically small, domain-specific, or lack the human-like stepwise structure necessary for compositional visual reasoning. In this paper, we introduce VisReason, a large-scale dataset designed to advance visual Chain-of-Thought reasoning. VisReason comprises 489K annotated examples spanning four diverse domains, each featuring multi-round, human-like rationales that guide MLLMs through interpretable visual reasoning steps. Building upon this, we curate VisReason-Pro, a 165K subset produced with a stronger expert-level GPT annotator, enriched with detailed reasoning traces and 3D spatial grounding via depth-informed annotations. Fine-tuning the state-of-the-art Qwen2.5-VL model on VisReason and VisReason-Pro yields substantial improvements in step-by-step visual reasoning accuracy, interpretability, and cross-benchmark generalization. These results demonstrate that VisReason equips MLLMs with more systematic and generalizable reasoning capabilities. We envision VisReason as a cornerstone for cultivating human-like visual reasoning, paving the way toward the next generation of multimodal intelligence.

VisReason: A Large-Scale Dataset for Visual Chain-of-Thought Reasoning

TL;DR

Abstract

VisReason: A Large-Scale Dataset for Visual Chain-of-Thought Reasoning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (22)