Table of Contents
Fetching ...

LENS: Learning to Segment Anything with Unified Reinforced Reasoning

Lianghui Zhu, Bin Ouyang, Yuxuan Zhang, Tianheng Cheng, Rui Hu, Haocheng Shen, Longjin Ran, Xiaoxin Chen, Li Yu, Wenyu Liu, Xinggang Wang

TL;DR

LENS tackles the challenge of text-prompted segmentation by embedding test-time chain-of-thought reasoning into a unified reinforcement-learning framework. It couples a multimodal LLM with a segmentation head through a context module and a pretraining alignment stage, guided by a unified GRPO objective that optimizes sentence-level reasoning, box localization, and pixel-level accuracy. The approach achieves state-of-the-art results on RefCOCO series and ReasonSeg/GroundingSuite benchmarks, demonstrating strong generalization to unseen prompts and domains. By enabling end-to-end reasoning-into-segmentation, LENS offers a scalable path toward more generalizable Segment Anything-style models.

Abstract

Text-prompted image segmentation enables fine-grained visual understanding and is critical for applications such as human-computer interaction and robotics. However, existing supervised fine-tuning methods typically ignore explicit chain-of-thought (CoT) reasoning at test time, which limits their ability to generalize to unseen prompts and domains. To address this issue, we introduce LENS, a scalable reinforcement-learning framework that jointly optimizes the reasoning process and segmentation in an end-to-end manner. We propose unified reinforcement-learning rewards that span sentence-, box-, and segment-level cues, encouraging the model to generate informative CoT rationales while refining mask quality. Using a publicly available 3-billion-parameter vision-language model, i.e., Qwen2.5-VL-3B-Instruct, LENS achieves an average cIoU of 81.2% on the RefCOCO, RefCOCO+, and RefCOCOg benchmarks, outperforming the strong fine-tuned method, i.e., GLaMM, by up to 5.6%. These results demonstrate that RL-driven CoT reasoning significantly enhances text-prompted segmentation and offers a practical path toward more generalizable Segment Anything models (SAM). Code is available at https://github.com/hustvl/LENS.

LENS: Learning to Segment Anything with Unified Reinforced Reasoning

TL;DR

LENS tackles the challenge of text-prompted segmentation by embedding test-time chain-of-thought reasoning into a unified reinforcement-learning framework. It couples a multimodal LLM with a segmentation head through a context module and a pretraining alignment stage, guided by a unified GRPO objective that optimizes sentence-level reasoning, box localization, and pixel-level accuracy. The approach achieves state-of-the-art results on RefCOCO series and ReasonSeg/GroundingSuite benchmarks, demonstrating strong generalization to unseen prompts and domains. By enabling end-to-end reasoning-into-segmentation, LENS offers a scalable path toward more generalizable Segment Anything-style models.

Abstract

Text-prompted image segmentation enables fine-grained visual understanding and is critical for applications such as human-computer interaction and robotics. However, existing supervised fine-tuning methods typically ignore explicit chain-of-thought (CoT) reasoning at test time, which limits their ability to generalize to unseen prompts and domains. To address this issue, we introduce LENS, a scalable reinforcement-learning framework that jointly optimizes the reasoning process and segmentation in an end-to-end manner. We propose unified reinforcement-learning rewards that span sentence-, box-, and segment-level cues, encouraging the model to generate informative CoT rationales while refining mask quality. Using a publicly available 3-billion-parameter vision-language model, i.e., Qwen2.5-VL-3B-Instruct, LENS achieves an average cIoU of 81.2% on the RefCOCO, RefCOCO+, and RefCOCOg benchmarks, outperforming the strong fine-tuned method, i.e., GLaMM, by up to 5.6%. These results demonstrate that RL-driven CoT reasoning significantly enhances text-prompted segmentation and offers a practical path toward more generalizable Segment Anything models (SAM). Code is available at https://github.com/hustvl/LENS.

Paper Structure

This paper contains 43 sections, 9 equations, 12 figures, 9 tables.

Figures (12)

  • Figure 1: Framework Comparison between proposed LENS and other methods.
  • Figure 2: An Overview of LENS framework. In the pretraining alignment stage, we only train the context query and connector with the segmentation objectiveness. In the reinforcement learning stage, we train all the parts except for the segmentation image encoder with the multi-grained objectiveness, i.e., the unified GRPO rewards and segmentation loss.
  • Figure 3: Qualitative results on the Referring Expression Segmentation task. The proposed LENS can accurately segment partially obscured objects. Benefit from the proposed unified framework, even if there is an error in the context box, the segmentation module can correct it based on the rich context in the multiple queries.
  • Figure 4: Visualization on reasoning segmentation. MLLM first generates a CoT reasoning process and a probable box as priors. Then the context queries extract messages from priors, and prompt segmentation module for an accurate mask.
  • Figure A1: Over Unified GRPO training, the format reward, segment reward, box reward, and the loss curves indicate steady reward gains and consistent loss decreases.
  • ...and 7 more figures