Table of Contents
Fetching ...

Visual Reasoning Tracer: Object-Level Grounded Reasoning Benchmark

Haobo Yuan, Yueyi Sun, Yanwei Li, Tao Zhang, Xueqing Deng, Henghui Ding, Lu Qi, Anran Wang, Xiangtai Li, Ming-Hsuan Yang

TL;DR

Visual Reasoning Tracer tackles opacity in multimodal reasoning by requiring step-by-step traces grounded in pixel-level segmentation masks. The authors release VRT-Bench and VRT-80k, plus Logical Quality (LQ) and Visual Quality (VQ) metrics to enable evaluation and training of interpretable visual reasoning. Empirical results show existing models struggle to ground intermediate steps, while the proposed R-Sa2VA approach, trained on VRT-80k, can produce faithful traces and improve final answers. The work lays a foundation for verifiable visual reasoning in multimodal models and points to future extensions, including video reasoning and self-correction capabilities.

Abstract

Recent advances in Multimodal Large Language Models (MLLMs) have significantly improved performance on tasks such as visual grounding and visual question answering. However, the reasoning processes of these models remain largely opaque; they typically output only final predictions without revealing the intermediate steps or fine-grained evidence (e.g., pixels, locations) that lead to the result. This contrasts with human intelligence, which naturally operates through a chain of visual reasoning. To address this limitation, we introduce the Visual Reasoning Tracer (VRT) task, which requires models to not only localize the target object but also explicitly predict the intermediate objects that form the reasoning path. To advance research in this area, we contribute: (1) VRT-Bench, a human-annotated benchmark for evaluating visual reasoning; (2) a new metric for assessing the quality of reasoning traces; and (3) VRT-80k, a large-scale dataset for reasoning model training. Our experiments reveal that while existing models often produce the correct final output, they struggle to ground their intermediate reasoning. In contrast, models trained on VRT-80k achieve substantial improvements in tracing the reasoning path.

Visual Reasoning Tracer: Object-Level Grounded Reasoning Benchmark

TL;DR

Visual Reasoning Tracer tackles opacity in multimodal reasoning by requiring step-by-step traces grounded in pixel-level segmentation masks. The authors release VRT-Bench and VRT-80k, plus Logical Quality (LQ) and Visual Quality (VQ) metrics to enable evaluation and training of interpretable visual reasoning. Empirical results show existing models struggle to ground intermediate steps, while the proposed R-Sa2VA approach, trained on VRT-80k, can produce faithful traces and improve final answers. The work lays a foundation for verifiable visual reasoning in multimodal models and points to future extensions, including video reasoning and self-correction capabilities.

Abstract

Recent advances in Multimodal Large Language Models (MLLMs) have significantly improved performance on tasks such as visual grounding and visual question answering. However, the reasoning processes of these models remain largely opaque; they typically output only final predictions without revealing the intermediate steps or fine-grained evidence (e.g., pixels, locations) that lead to the result. This contrasts with human intelligence, which naturally operates through a chain of visual reasoning. To address this limitation, we introduce the Visual Reasoning Tracer (VRT) task, which requires models to not only localize the target object but also explicitly predict the intermediate objects that form the reasoning path. To advance research in this area, we contribute: (1) VRT-Bench, a human-annotated benchmark for evaluating visual reasoning; (2) a new metric for assessing the quality of reasoning traces; and (3) VRT-80k, a large-scale dataset for reasoning model training. Our experiments reveal that while existing models often produce the correct final output, they struggle to ground their intermediate reasoning. In contrast, models trained on VRT-80k achieve substantial improvements in tracing the reasoning path.

Paper Structure

This paper contains 27 sections, 7 equations, 27 figures, 8 tables.

Figures (27)

  • Figure 1: Visual Reasoning Tracer (VRT). Given an image and a question, we ask an MLLM to generate a step-by-step reasoning process. Each step is grounded by its corresponding visual tracer, making the model’s decision-making process transparent and easy to understand. We use masks to ground each object in the scene. See more demos in the supplementary material.
  • Figure 2: The training data pipeline. Our two-stage pipeline first generates object-caption pairs by segmenting an image with the Segment Anything Model (SAM) and RAM++/APE and describing each mask with the Describe Anything Model (DAM). In the second stage, these grounded captions are formatted into a prompt to guide Gemini in generating complex question-reasoning-answer data.
  • Figure 3: Overview of the proposed R-Sa2VA framework. (a) Inference Pipeline: Given a user query, the MLLM engages in an explicit visual reasoning process. It generates a "Thinking" sequence to identify and segment objects (e.g., distinguishing the "smaller pancake") before producing the final "Answering" sequence. A Mask Decoder translates the embedded [SEG] tokens into pixel-level masks, visually grounding the reasoning trace. (b) Reinforcement Fine-Tuning: To align the model's reasoning with accurate grounding, we employ a sampling-based optimization strategy. The model is trained using a dual-reward mechanism: a Language Reward that enforces structural consistency (e.g., correct usage of < think> tags) and a Matching-based IoU Reward that evaluates the quality of the generated masks.
  • Figure 4: Visualization on RefCOCO dataset.
  • Figure A1: RL training dynamics on VRT-Bench. Evolution of reasoning/answer Logic Quality (LQ), Visual Quality (VQ), and mean IoU (mIoU) during reinforcement learning of R-Sa2VA-Qwen3VL-4B-RL. The first two rows report per-category scores for Functionality, Visual Features, Location, and Comparison, while the bottom row summarizes overall reasoning and answer metrics and compares them to the SFT-only baseline (orange dashed lines).
  • ...and 22 more figures