Table of Contents
Fetching ...

SAM-R1: Leveraging SAM for Reward Feedback in Multimodal Segmentation via Reinforcement Learning

Jiaqi Huang, Zunnan Xu, Jun Zhou, Ting Liu, Yicheng Xiao, Mingwen Ou, Bowen Ji, Xiu Li, Kehong Yuan

TL;DR

SAM-R1 tackles fine-grained reasoning segmentation in multimodal settings by embedding SAM within a reinforcement-learning loop as a reward provider. It introduces task-specific, fine-grained rewards and an enhanced GRPO-based optimization (with asymmetric clipping and token-level loss normalization) to align reasoning with pixel-precise segmentation, achieving strong zero-shot results using only 3k training samples. Empirically, SAM-R1 outperforms prior methods on ReasonSeg and demonstrates robust generalization to referring expression grounding (REC) tasks, indicating effective cross-domain transfer without REC supervision. The work highlights that reward-guided learning can instill perceptual reasoning in multimodal models while reducing data requirements, with potential for broader applications beyond segmentation.

Abstract

Leveraging multimodal large models for image segmentation has become a prominent research direction. However, existing approaches typically rely heavily on manually annotated datasets that include explicit reasoning processes, which are costly and time-consuming to produce. Recent advances suggest that reinforcement learning (RL) can endow large models with reasoning capabilities without requiring such reasoning-annotated data. In this paper, we propose SAM-R1, a novel framework that enables multimodal large models to perform fine-grained reasoning in image understanding tasks. Our approach is the first to incorporate fine-grained segmentation settings during the training of multimodal reasoning models. By integrating task-specific, fine-grained rewards with a tailored optimization objective, we further enhance the model's reasoning and segmentation alignment. We also leverage the Segment Anything Model (SAM) as a strong and flexible reward provider to guide the learning process. With only 3k training samples, SAM-R1 achieves strong performance across multiple benchmarks, demonstrating the effectiveness of reinforcement learning in equipping multimodal models with segmentation-oriented reasoning capabilities.

SAM-R1: Leveraging SAM for Reward Feedback in Multimodal Segmentation via Reinforcement Learning

TL;DR

SAM-R1 tackles fine-grained reasoning segmentation in multimodal settings by embedding SAM within a reinforcement-learning loop as a reward provider. It introduces task-specific, fine-grained rewards and an enhanced GRPO-based optimization (with asymmetric clipping and token-level loss normalization) to align reasoning with pixel-precise segmentation, achieving strong zero-shot results using only 3k training samples. Empirically, SAM-R1 outperforms prior methods on ReasonSeg and demonstrates robust generalization to referring expression grounding (REC) tasks, indicating effective cross-domain transfer without REC supervision. The work highlights that reward-guided learning can instill perceptual reasoning in multimodal models while reducing data requirements, with potential for broader applications beyond segmentation.

Abstract

Leveraging multimodal large models for image segmentation has become a prominent research direction. However, existing approaches typically rely heavily on manually annotated datasets that include explicit reasoning processes, which are costly and time-consuming to produce. Recent advances suggest that reinforcement learning (RL) can endow large models with reasoning capabilities without requiring such reasoning-annotated data. In this paper, we propose SAM-R1, a novel framework that enables multimodal large models to perform fine-grained reasoning in image understanding tasks. Our approach is the first to incorporate fine-grained segmentation settings during the training of multimodal reasoning models. By integrating task-specific, fine-grained rewards with a tailored optimization objective, we further enhance the model's reasoning and segmentation alignment. We also leverage the Segment Anything Model (SAM) as a strong and flexible reward provider to guide the learning process. With only 3k training samples, SAM-R1 achieves strong performance across multiple benchmarks, demonstrating the effectiveness of reinforcement learning in equipping multimodal models with segmentation-oriented reasoning capabilities.

Paper Structure

This paper contains 20 sections, 5 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: SAM-R1 generates a reasoning chain prior to producing the segmentation mask. It employs a reinforcement learning (RL) strategy, learning the reasoning process from scratch. In comparison to supervised fine-tuning (SFT), the RL-enhanced model, which incorporates fine-grained rewards based on SAM, demonstrates superior performance on both in-domain and out-of-domain data.
  • Figure 2: Our framework integrates the Segment Anything Model (SAM) as a reward provider in the reinforcement learning training of a multimodal large model (MLLM). The two models jointly process user-input questions and images to identify target objects and generate masks. Specifically, the MLLM generates the reasoning process and answer, then passes them to SAM. A fine-grained reward based on Intersection over Union (IoU) is calculated to optimize the MLLM.
  • Figure 3: Qualitative results on ReasonSeg lai2024lisa demonstrate that SAM-R1 exhibits robust zero-shot performance, further enhanced by the chain-of-thought approach with improved reasoning capacity.
  • Figure 4: Ablation study failures: (a) Removing the KL constraint leads to training instability and collapse. (b) Encouraging both positive and negative point generation causes negatives to appear outside target areas. (c) Forcing all points into the bounding box eliminates useful contrast, reducing performance.