Revisiting Salient Object Detection from an Observer-Centric Perspective
Fuxi Zhang, Yifan Wang, Hengrun Zhao, Zhuohan Sun, Changxing Xia, Lijun Wang, Huchuan Lu, Yangrui Shao, Chen Yang, Long Teng
TL;DR
This work reframes salient object detection as an observer-centric task, proposing OC-SOD to model saliency conditioned on observer state via $P(I|T)$, where $T$ encodes preferences or intents in addition to visual cues. It introduces OC-SODBench, a large-scale dataset with 33k images and 152k instruction-mask pairs generated through an efficient MLLM-driven pipeline across free-viewing, preference-driven, and intent-driven modes. The authors present OC-SODAgent, an agentic baseline that reasons about saliency with an MLLM and refines predictions through a Perceive–Reflect–Adjust loop, achieving strong zero-shot performance and further gains with fine-tuning. By marrying observer-centric prompts with modular segmentation tools, this work lays the groundwork for personalized and context-aware visual saliency modeling and provides public data and code to accelerate future research.
Abstract
Salient object detection is inherently a subjective problem, as observers with different priors may perceive different objects as salient. However, existing methods predominantly formulate it as an objective prediction task with a single groundtruth segmentation map for each image, which renders the problem under-determined and fundamentally ill-posed. To address this issue, we propose Observer-Centric Salient Object Detection (OC-SOD), where salient regions are predicted by considering not only the visual cues but also the observer-specific factors such as their preferences or intents. As a result, this formulation captures the intrinsic ambiguity and diversity of human perception, enabling personalized and context-aware saliency prediction. By leveraging multi-modal large language models, we develop an efficient data annotation pipeline and construct the first OC-SOD dataset named OC-SODBench, comprising 33k training, validation and test images with 152k textual prompts and object pairs. Built upon this new dataset, we further design OC-SODAgent, an agentic baseline which performs OC-SOD via a human-like "Perceive-Reflect-Adjust" process. Extensive experiments on our proposed OC-SODBench have justified the effectiveness of our contribution. Through this observer-centric perspective, we aim to bridge the gap between human perception and computational modeling, offering a more realistic and flexible understanding of what makes an object truly "salient." Code and dataset are publicly available at: https://github.com/Dustzx/OC_SOD
