GenSeg-R1: RL-Driven Vision-Language Grounding for Fine-Grained Referring Segmentation

Sandesh Hegde; Jaison Saji Chacko; Debarshi Banerjee; Uma Mahesh

GenSeg-R1: RL-Driven Vision-Language Grounding for Fine-Grained Referring Segmentation

Sandesh Hegde, Jaison Saji Chacko, Debarshi Banerjee, Uma Mahesh

TL;DR

GenSeg-R1 presents a reason-then-segment framework for fine-grained referring segmentation that couples a finetuned Qwen3-VL with a frozen SAM 2 segmenter, using GRPO to train the VLM with a reward grounded in downstream mask quality. It introduces two reward variants: a fast, distance-based GRPO and a SAM 2-in-the-loop reward that optimizes actual segmentation quality and no-target handling, with GenSeg-R1-G trained on GRefCOCO to robustly reject empty targets. The approach yields state-of-the-art results on RefCOCOg and strong performance on GRefCOCO and ReasonSeg, while producing emergent reasoning traces in <think> outputs and achieving high no-target accuracy. These results demonstrate robust grounding for segmentation under both positive and negative queries and highlight the practical value of integrating a downstream segmenter into the learning loop. The work suggests a practical two-stage training recipe and points to broader applicability in interactive vision systems and robotics where reliable no-target rejection and precise masks are essential.

Abstract

We study fine-grained referring image segmentation via a decoupled reason-then-segment pipeline. A vision-language model (VLM) receives an image and a natural-language query, reasons about the scene, and emits structured spatial prompts: a bounding box plus two interior keypoints for every referred instance. A frozen promptable segmenter (SAM 2) converts these prompts into high-quality masks. Within our GenSeg-R1 framework we finetune Qwen3-VL models (4B and 8B parameters) using Group Relative Policy Optimization (GRPO), requiring no supervised reasoning-chain annotations. On RefCOCOg validation our best model (GenSeg-R1-8B) achieves 0.7127 cIoU and 0.7382 mIoU, substantially outperforming the corresponding Qwen3-VL Instruct baselines (+15.3 and +21.9 points, respectively) and surpassing Seg-Zero-7B [3] by +3.3 cIoU under identical evaluation. We further introduce GenSeg-R1-G, a variant trained on GRefCOCO [9] with a SAM 2 in-the-loop reward that directly optimizes mask quality. On GRefCOCO validation GenSeg-R1-G achieves 76.69% target mIoU with 82.40% accuracy on negative (no-target) prompts, substantially outperforming Seg-R1-7B and Seg-Zero-7B, which lack no-target detection capability. On ReasonSeg test, GenSeg-R1-4B reaches 68.40% mIoU, surpassing Seg-Zero-7B by +7.0 and Seg-R1-7B by +10.7 points.

GenSeg-R1: RL-Driven Vision-Language Grounding for Fine-Grained Referring Segmentation

TL;DR

Abstract

Paper Structure (45 sections, 1 equation, 4 figures, 8 tables)

This paper contains 45 sections, 1 equation, 4 figures, 8 tables.

Introduction
Contributions.
Related Work
Promptable segmentation.
Grounding for segmentation.
Referring expression benchmarks.
Instruction data for grounding.
Method
Pipeline Overview
Base Models and Coordinate System
Output Format: Box + Two Keypoints
SAM 2 Integration
Reinforcement Learning with GRPO
Reward Design
GenSeg-R1-4B/8B reward (distance-based).
...and 30 more sections

Figures (4)

Figure 1: GenSeg-R1 architecture.Stage 1: Qwen3-VL (trainable) processes the image and query, reasons in <think> tags, and outputs structured spatial prompts (bounding box + two keypoints) or a no_target flag. Stage 2: SAM 2 (frozen) converts prompts into masks. During GRPO training (dashed arrows), the SAM 2 mask IoU feeds back as a reward signal to update the VLM policy.
Figure 2: Qualitative segmentation comparison. Each row shows a different query. GenSeg-R1-G and GenSeg-R1-4B produce accurate masks (green overlay) closely matching ground truth, while Seg-R1-7B and Seg-Zero-7B produce poor or misaligned masks.
Figure 3: No-target detection comparison. For queries with no matching object, GenSeg-R1-G and GenSeg-R1-4B correctly predict no_target, while Seg-R1-7B and Seg-Zero-7B hallucinate masks for non-existent objects.
Figure 4: Qualitative ReasonSeg comparison. Each row shows a reasoning-heavy query from the ReasonSeg test set. The first three rows illustrate successful cases where GenSeg-R1-G and GenSeg-R1-4B correctly reason about implicit queries, while baselines struggle. The last two rows show failure cases where all models fail to identify the target object.

GenSeg-R1: RL-Driven Vision-Language Grounding for Fine-Grained Referring Segmentation

TL;DR

Abstract

GenSeg-R1: RL-Driven Vision-Language Grounding for Fine-Grained Referring Segmentation

Authors

TL;DR

Abstract

Table of Contents

Figures (4)