Ground-R1: Incentivizing Grounded Visual Reasoning via Reinforcement Learning

Meng Cao; Haoze Zhao; Can Zhang; Xiaojun Chang; Ian Reid; Xiaodan Liang

Ground-R1: Incentivizing Grounded Visual Reasoning via Reinforcement Learning

Meng Cao, Haoze Zhao, Can Zhang, Xiaojun Chang, Ian Reid, Xiaodan Liang

TL;DR

This work tackles unreliable and opaque reasoning in large vision-language models by introducing Ground-R1, an annotation-free grounded visual reasoning framework. It splits reasoning into a grounding phase that generates evidence-region rollouts under strict format constraints and an answering phase that produces final responses guided by format and accuracy rewards, trained with GRPO. Empirical results show Ground-R1 surpassing supervised fine-tuning and prior grounded approaches on VisCoT and LVLM benchmarks, and it exhibits emergent cognitive behaviors like uncertainty awareness and iterative refinement. The approach offers a scalable, interpretable alternative to bounding-box–dependent grounding for multi-modal reasoning tasks.

Abstract

Large Vision-Language Models (LVLMs) have demonstrated impressive general capabilities across a wide range of multi-modal tasks. However, the reasoning processes of LVLMs often suffer from unreliable outputs and limited interpretability. To address this, grounded visual reasoning has emerged as a promising paradigm that enforces responses anchored on salient visual evidence regions. However, existing approaches typically rely on costly supervision such as bounding box annotations, chain-of-thought rationale or external tool calls, limiting their scalability. In this work, we propose Ground-R1, a reinforcement learning framework that enables grounded visual reasoning without requiring explicit evidence or rationale annotations. Ground-R1 consists of a grounding phase that generates evidence region rollouts based on format constraints, and an answering phase that produces responses guided by both answer correctness and format adherence rewards. Extensive experiments across multiple visual reasoning benchmarks manifest that Ground-R1 achieves superior performance and exhibits emergent cognitive behaviors such as uncertainty awareness, spatial perception, and iterative refinement, offering a scalable and interpretable alternative to existing approaches.

Ground-R1: Incentivizing Grounded Visual Reasoning via Reinforcement Learning

TL;DR

Abstract

Ground-R1: Incentivizing Grounded Visual Reasoning via Reinforcement Learning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)