Table of Contents
Fetching ...

Saliency-R1: Incentivizing Unified Saliency Reasoning Capability in MLLM with Confidence-Guided Reinforcement Learning

Long Li, Shuichen Ji, Ziyang Luo, Zhihui Li, Dingwen Zhang, Junwei Han, Nian Liu

TL;DR

Saliency-R1 addresses the gap in saliency reasoning for multimodal LLMs by unifying Salient Object Detection, Salient Instance Segmentation, and Co-salient Object Detection under a single framework that uses a structured textual interface with <rg> and <ins> tags. It pairs a two-stage training pipeline (Supervised Fine-Tuning and Confidence-Guided Policy Optimization) with a novel per-sample reward signal A = r - c, enabling confidence-aware, single-sample reinforcement learning and reducing overhead relative to GRPO. A dedicated referring segmenter (EVF-SAM) parses CoT-derived expressions to produce task-specific masks, while task-adaptive rewards and strict formatting ensure coherent, parseable reasoning outputs. Across nine standard benchmarks, Saliency-R1 matches or surpasses robust MLLMs and task-specific methods, validating unified saliency reasoning and offering practical impact for integrated vision-language systems.

Abstract

Although multimodal large language models (MLLMs) excel in high-level vision-language reasoning, they lack inherent awareness of visual saliency, making it difficult to identify key visual elements. To bridge this gap, we propose Saliency-R1, the first unified MLLM framework that jointly tackles three representative and heterogeneous saliency tasks: Salient Object Detection (SOD), Salient Instance Segmentation (SIS), and Co-salient Object Detection (CoSOD), enhancing the model's capacity for saliency reasoning. We introduce a textual interface with structured tags (<rg>, <ins>) to encode region- and instance-level referring expressions, enabling a single referring segmenter to produce task-appropriate masks. To train the MLLM efficiently, we propose Confidence-Guided Policy Optimization (CGPO), a novel single-sample reinforcement learning algorithm. CGPO improves on GRPO by replacing group-normalized advantages with a per-sample signal based on reward-confidence discrepancy, thereby reducing computational waste, mitigating signal dilution, and lowering training overhead. Our model exceeds or matches the performance of robust open/closed-source MLLMs and specialized state-of-the-art methods across all three tasks, demonstrating the efficacy of our framework in saliency reasoning.

Saliency-R1: Incentivizing Unified Saliency Reasoning Capability in MLLM with Confidence-Guided Reinforcement Learning

TL;DR

Saliency-R1 addresses the gap in saliency reasoning for multimodal LLMs by unifying Salient Object Detection, Salient Instance Segmentation, and Co-salient Object Detection under a single framework that uses a structured textual interface with <rg> and <ins> tags. It pairs a two-stage training pipeline (Supervised Fine-Tuning and Confidence-Guided Policy Optimization) with a novel per-sample reward signal A = r - c, enabling confidence-aware, single-sample reinforcement learning and reducing overhead relative to GRPO. A dedicated referring segmenter (EVF-SAM) parses CoT-derived expressions to produce task-specific masks, while task-adaptive rewards and strict formatting ensure coherent, parseable reasoning outputs. Across nine standard benchmarks, Saliency-R1 matches or surpasses robust MLLMs and task-specific methods, validating unified saliency reasoning and offering practical impact for integrated vision-language systems.

Abstract

Although multimodal large language models (MLLMs) excel in high-level vision-language reasoning, they lack inherent awareness of visual saliency, making it difficult to identify key visual elements. To bridge this gap, we propose Saliency-R1, the first unified MLLM framework that jointly tackles three representative and heterogeneous saliency tasks: Salient Object Detection (SOD), Salient Instance Segmentation (SIS), and Co-salient Object Detection (CoSOD), enhancing the model's capacity for saliency reasoning. We introduce a textual interface with structured tags (<rg>, <ins>) to encode region- and instance-level referring expressions, enabling a single referring segmenter to produce task-appropriate masks. To train the MLLM efficiently, we propose Confidence-Guided Policy Optimization (CGPO), a novel single-sample reinforcement learning algorithm. CGPO improves on GRPO by replacing group-normalized advantages with a per-sample signal based on reward-confidence discrepancy, thereby reducing computational waste, mitigating signal dilution, and lowering training overhead. Our model exceeds or matches the performance of robust open/closed-source MLLMs and specialized state-of-the-art methods across all three tasks, demonstrating the efficacy of our framework in saliency reasoning.

Paper Structure

This paper contains 34 sections, 24 equations, 16 figures, 5 tables.

Figures (16)

  • Figure 1: Existing multimodal large language models (MLLMs) exhibit limitations in saliency reasoning. This paper proposes Saliency-R1 to incentivize unified saliency reasoning of MLLM across three representative tasks, i.e., Salient Object Detection (SOD), Salient Instance Segmentation (SIS), and Co-salient Object Detection (CoSOD).
  • Figure 2: Overview of the Saliency-R1 framework. Given visual inputs and task-specific instructions, the MLLM (QWen2.5-vl) generates structured CoT reasoning, where referring expressions can be parsed and processed by a referring segmentation model (EVF-SAM) to produce task-specific masks for SOD, SIS, and CoSOD.
  • Figure 3: Comparison between GRPO and our proposed CGPO. GRPO uses multiple responses, group-normalized advantages, and KL regularization with a reference model. In contrast, CGPO employs a single response, calculates advantage with reward-confidence discrepancy, and replaces KL with ISR (Interleaved SFT Regularization).
  • Figure 3: Comparison experiments with SOTA closed-source and opened-source MLLMs on the ECSSD, ILSO, and CoSOD3k datasets. We use bold and underline to mark the best and second-best excellent results, respectively.
  • Figure 4: Joint Response-Type Distribution and Per-Sample Advantage Analysis. We analyze 8,000 responses generated by the SFT-initialized model, categorizing them into four types based on the bottom/top 20% empirical thresholds of reward and model confidence. Confidence is defined as the mean token-wise generation probability of the CoT output (Eq. \ref{['confidence_calcualtion']}). Advantages for GRPO and CGPO are rank-normalized chen2022rank to $[-1,\,1]$, respectively, to ensure fair comparison.
  • ...and 11 more figures