Table of Contents
Fetching ...

VGR: Visual Grounded Reasoning

Jiacong Wang, Zijian Kang, Haochen Wang, Haiyong Jiang, Jiawen Li, Bohong Wu, Ya Wang, Jiao Ran, Xiao Liang, Chao Feng, Jun Xiao

TL;DR

This work tackles language bias in multimodal reasoning by introducing Visual Grounded Reasoning (VGR), which grounds inference in targeted image regions via a selective visual replay mechanism. It couples a memory pool of high-resolution visual tokens with a replay-controlled bounding-box signaling to enrich reasoning with on-demand visual evidence. A novel VGR-SFT dataset is built through a three-stage data pipeline (cold-start, reject sampling, annotation model) to explicitly model region attention in multimodal reasoning. Experimental results show VGR achieves state-of-the-art performance on several benchmarks while using only a fraction of the image tokens, indicating improved efficiency and interpretability for multimodal large language models.

Abstract

In the field of multimodal chain-of-thought (CoT) reasoning, existing approaches predominantly rely on reasoning on pure language space, which inherently suffers from language bias and is largely confined to math or science domains. This narrow focus limits their ability to handle complex visual reasoning tasks that demand comprehensive understanding of image details. To address these limitations, this paper introduces VGR, a novel reasoning multimodal large language model (MLLM) with enhanced fine-grained visual perception capabilities. Unlike traditional MLLMs that answer the question or reasoning solely on the language space, our VGR first detects relevant regions that may help to solve problems, and then provides precise answers based on replayed image regions. To achieve this, we conduct a large-scale SFT dataset called VGR -SFT that contains reasoning data with mixed vision grounding and language deduction. The inference pipeline of VGR allows the model to choose bounding boxes for visual reference and a replay stage is introduced to integrates the corresponding regions into the reasoning process, enhancing multimodel comprehension. Experiments on the LLaVA-NeXT-7B baseline show that VGR achieves superior performance on multi-modal benchmarks requiring comprehensive image detail understanding. Compared to the baseline, VGR uses only 30\% of the image token count while delivering scores of +4.1 on MMStar, +7.1 on AI2D, and a +12.9 improvement on ChartQA.

VGR: Visual Grounded Reasoning

TL;DR

This work tackles language bias in multimodal reasoning by introducing Visual Grounded Reasoning (VGR), which grounds inference in targeted image regions via a selective visual replay mechanism. It couples a memory pool of high-resolution visual tokens with a replay-controlled bounding-box signaling to enrich reasoning with on-demand visual evidence. A novel VGR-SFT dataset is built through a three-stage data pipeline (cold-start, reject sampling, annotation model) to explicitly model region attention in multimodal reasoning. Experimental results show VGR achieves state-of-the-art performance on several benchmarks while using only a fraction of the image tokens, indicating improved efficiency and interpretability for multimodal large language models.

Abstract

In the field of multimodal chain-of-thought (CoT) reasoning, existing approaches predominantly rely on reasoning on pure language space, which inherently suffers from language bias and is largely confined to math or science domains. This narrow focus limits their ability to handle complex visual reasoning tasks that demand comprehensive understanding of image details. To address these limitations, this paper introduces VGR, a novel reasoning multimodal large language model (MLLM) with enhanced fine-grained visual perception capabilities. Unlike traditional MLLMs that answer the question or reasoning solely on the language space, our VGR first detects relevant regions that may help to solve problems, and then provides precise answers based on replayed image regions. To achieve this, we conduct a large-scale SFT dataset called VGR -SFT that contains reasoning data with mixed vision grounding and language deduction. The inference pipeline of VGR allows the model to choose bounding boxes for visual reference and a replay stage is introduced to integrates the corresponding regions into the reasoning process, enhancing multimodel comprehension. Experiments on the LLaVA-NeXT-7B baseline show that VGR achieves superior performance on multi-modal benchmarks requiring comprehensive image detail understanding. Compared to the baseline, VGR uses only 30\% of the image token count while delivering scores of +4.1 on MMStar, +7.1 on AI2D, and a +12.9 improvement on ChartQA.

Paper Structure

This paper contains 26 sections, 7 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: Overview framework of our method. In the left of the image, we crop the original image with AnyRes strategy to maintain the memory pool of visual details, when a replay signal is detected, VGR retrieves the image token from the memory pool, enrich visual clues in reasoning. In the right image, we show an example of VGR in action, VGR enables the MLLM to check the key area on-demand.
  • Figure 2: Overview framework of our data pipeline. The blue arrow line indicates the cold-start data curation pipeline for the annotator and the green line indicates the data pipeline for training data.
  • Figure 3: Example of training data in VGR-SFT.
  • Figure 4: Example generated by our annotation model. We distill core information and the chain-of-thought from long redundant reasoning with reject sampling and rewriting.
  • Figure 5: Example of training data in VGR-SFT in different formulations.
  • ...and 2 more figures