Table of Contents
Fetching ...

Ground-R1: Incentivizing Grounded Visual Reasoning via Reinforcement Learning

Meng Cao, Haoze Zhao, Can Zhang, Xiaojun Chang, Ian Reid, Xiaodan Liang

TL;DR

This work tackles unreliable and opaque reasoning in large vision-language models by introducing Ground-R1, an annotation-free grounded visual reasoning framework. It splits reasoning into a grounding phase that generates evidence-region rollouts under strict format constraints and an answering phase that produces final responses guided by format and accuracy rewards, trained with GRPO. Empirical results show Ground-R1 surpassing supervised fine-tuning and prior grounded approaches on VisCoT and LVLM benchmarks, and it exhibits emergent cognitive behaviors like uncertainty awareness and iterative refinement. The approach offers a scalable, interpretable alternative to bounding-box–dependent grounding for multi-modal reasoning tasks.

Abstract

Large Vision-Language Models (LVLMs) have demonstrated impressive general capabilities across a wide range of multi-modal tasks. However, the reasoning processes of LVLMs often suffer from unreliable outputs and limited interpretability. To address this, grounded visual reasoning has emerged as a promising paradigm that enforces responses anchored on salient visual evidence regions. However, existing approaches typically rely on costly supervision such as bounding box annotations, chain-of-thought rationale or external tool calls, limiting their scalability. In this work, we propose Ground-R1, a reinforcement learning framework that enables grounded visual reasoning without requiring explicit evidence or rationale annotations. Ground-R1 consists of a grounding phase that generates evidence region rollouts based on format constraints, and an answering phase that produces responses guided by both answer correctness and format adherence rewards. Extensive experiments across multiple visual reasoning benchmarks manifest that Ground-R1 achieves superior performance and exhibits emergent cognitive behaviors such as uncertainty awareness, spatial perception, and iterative refinement, offering a scalable and interpretable alternative to existing approaches.

Ground-R1: Incentivizing Grounded Visual Reasoning via Reinforcement Learning

TL;DR

This work tackles unreliable and opaque reasoning in large vision-language models by introducing Ground-R1, an annotation-free grounded visual reasoning framework. It splits reasoning into a grounding phase that generates evidence-region rollouts under strict format constraints and an answering phase that produces final responses guided by format and accuracy rewards, trained with GRPO. Empirical results show Ground-R1 surpassing supervised fine-tuning and prior grounded approaches on VisCoT and LVLM benchmarks, and it exhibits emergent cognitive behaviors like uncertainty awareness and iterative refinement. The approach offers a scalable, interpretable alternative to bounding-box–dependent grounding for multi-modal reasoning tasks.

Abstract

Large Vision-Language Models (LVLMs) have demonstrated impressive general capabilities across a wide range of multi-modal tasks. However, the reasoning processes of LVLMs often suffer from unreliable outputs and limited interpretability. To address this, grounded visual reasoning has emerged as a promising paradigm that enforces responses anchored on salient visual evidence regions. However, existing approaches typically rely on costly supervision such as bounding box annotations, chain-of-thought rationale or external tool calls, limiting their scalability. In this work, we propose Ground-R1, a reinforcement learning framework that enables grounded visual reasoning without requiring explicit evidence or rationale annotations. Ground-R1 consists of a grounding phase that generates evidence region rollouts based on format constraints, and an answering phase that produces responses guided by both answer correctness and format adherence rewards. Extensive experiments across multiple visual reasoning benchmarks manifest that Ground-R1 achieves superior performance and exhibits emergent cognitive behaviors such as uncertainty awareness, spatial perception, and iterative refinement, offering a scalable and interpretable alternative to existing approaches.

Paper Structure

This paper contains 15 sections, 4 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Reasoning trajectories of Ground-R1. Our method enables visual reasoning through automatic zoom-in operations while eliminating the need for explicit bounding box supervision, rationale annotations, or external tool calls. Ground-R1 exhibits human-like cognitive capabilities including uncertainty awareness, spatial perception, and iterative refinement. Refer to Appendix \ref{['sec:A2']} for more visualization cases.
  • Figure 2: Schematic illustrations of Ground-R1. The grounding phase analyzes input instructions and generates evidence region rollouts, which are supervised by the format reward. $\boldsymbol{b}_i \in \mathbb{R}^{4}$ denotes the axis-aligned bounding box coordinates and $\boldsymbol{e}_i$ is the corresponding cropped evidence region, $i \in [1, G_1]$. The answering phase takes the input image, question, and the generated evidence regions as input and delivers final answers. This procedure is driven by both the format and accuracy maximization rewards. $\boldsymbol{o}_{i,j}$ denote the $j$-th rollout answers based on the $i$-th evidence region $\boldsymbol{e}_{i}$. $j \in [1, G_2]$. $A_{i,j}$ is the computed advantages (c.f. Eq. \ref{['eq:4']}).
  • Figure 3: Visualizations of training curves of Ground-R1.
  • Figure 4: Reasoning trajectories of Ground-R1.
  • Figure 5: Reasoning trajectories of Ground-R1.
  • ...and 3 more figures