Saliency-R1: Incentivizing Unified Saliency Reasoning Capability in MLLM with Confidence-Guided Reinforcement Learning
Long Li, Shuichen Ji, Ziyang Luo, Zhihui Li, Dingwen Zhang, Junwei Han, Nian Liu
TL;DR
Saliency-R1 addresses the gap in saliency reasoning for multimodal LLMs by unifying Salient Object Detection, Salient Instance Segmentation, and Co-salient Object Detection under a single framework that uses a structured textual interface with <rg> and <ins> tags. It pairs a two-stage training pipeline (Supervised Fine-Tuning and Confidence-Guided Policy Optimization) with a novel per-sample reward signal A = r - c, enabling confidence-aware, single-sample reinforcement learning and reducing overhead relative to GRPO. A dedicated referring segmenter (EVF-SAM) parses CoT-derived expressions to produce task-specific masks, while task-adaptive rewards and strict formatting ensure coherent, parseable reasoning outputs. Across nine standard benchmarks, Saliency-R1 matches or surpasses robust MLLMs and task-specific methods, validating unified saliency reasoning and offering practical impact for integrated vision-language systems.
Abstract
Although multimodal large language models (MLLMs) excel in high-level vision-language reasoning, they lack inherent awareness of visual saliency, making it difficult to identify key visual elements. To bridge this gap, we propose Saliency-R1, the first unified MLLM framework that jointly tackles three representative and heterogeneous saliency tasks: Salient Object Detection (SOD), Salient Instance Segmentation (SIS), and Co-salient Object Detection (CoSOD), enhancing the model's capacity for saliency reasoning. We introduce a textual interface with structured tags (<rg>, <ins>) to encode region- and instance-level referring expressions, enabling a single referring segmenter to produce task-appropriate masks. To train the MLLM efficiently, we propose Confidence-Guided Policy Optimization (CGPO), a novel single-sample reinforcement learning algorithm. CGPO improves on GRPO by replacing group-normalized advantages with a per-sample signal based on reward-confidence discrepancy, thereby reducing computational waste, mitigating signal dilution, and lowering training overhead. Our model exceeds or matches the performance of robust open/closed-source MLLMs and specialized state-of-the-art methods across all three tasks, demonstrating the efficacy of our framework in saliency reasoning.
