Table of Contents
Fetching ...

Revisiting Salient Object Detection from an Observer-Centric Perspective

Fuxi Zhang, Yifan Wang, Hengrun Zhao, Zhuohan Sun, Changxing Xia, Lijun Wang, Huchuan Lu, Yangrui Shao, Chen Yang, Long Teng

TL;DR

This work reframes salient object detection as an observer-centric task, proposing OC-SOD to model saliency conditioned on observer state via $P(I|T)$, where $T$ encodes preferences or intents in addition to visual cues. It introduces OC-SODBench, a large-scale dataset with 33k images and 152k instruction-mask pairs generated through an efficient MLLM-driven pipeline across free-viewing, preference-driven, and intent-driven modes. The authors present OC-SODAgent, an agentic baseline that reasons about saliency with an MLLM and refines predictions through a Perceive–Reflect–Adjust loop, achieving strong zero-shot performance and further gains with fine-tuning. By marrying observer-centric prompts with modular segmentation tools, this work lays the groundwork for personalized and context-aware visual saliency modeling and provides public data and code to accelerate future research.

Abstract

Salient object detection is inherently a subjective problem, as observers with different priors may perceive different objects as salient. However, existing methods predominantly formulate it as an objective prediction task with a single groundtruth segmentation map for each image, which renders the problem under-determined and fundamentally ill-posed. To address this issue, we propose Observer-Centric Salient Object Detection (OC-SOD), where salient regions are predicted by considering not only the visual cues but also the observer-specific factors such as their preferences or intents. As a result, this formulation captures the intrinsic ambiguity and diversity of human perception, enabling personalized and context-aware saliency prediction. By leveraging multi-modal large language models, we develop an efficient data annotation pipeline and construct the first OC-SOD dataset named OC-SODBench, comprising 33k training, validation and test images with 152k textual prompts and object pairs. Built upon this new dataset, we further design OC-SODAgent, an agentic baseline which performs OC-SOD via a human-like "Perceive-Reflect-Adjust" process. Extensive experiments on our proposed OC-SODBench have justified the effectiveness of our contribution. Through this observer-centric perspective, we aim to bridge the gap between human perception and computational modeling, offering a more realistic and flexible understanding of what makes an object truly "salient." Code and dataset are publicly available at: https://github.com/Dustzx/OC_SOD

Revisiting Salient Object Detection from an Observer-Centric Perspective

TL;DR

This work reframes salient object detection as an observer-centric task, proposing OC-SOD to model saliency conditioned on observer state via , where encodes preferences or intents in addition to visual cues. It introduces OC-SODBench, a large-scale dataset with 33k images and 152k instruction-mask pairs generated through an efficient MLLM-driven pipeline across free-viewing, preference-driven, and intent-driven modes. The authors present OC-SODAgent, an agentic baseline that reasons about saliency with an MLLM and refines predictions through a Perceive–Reflect–Adjust loop, achieving strong zero-shot performance and further gains with fine-tuning. By marrying observer-centric prompts with modular segmentation tools, this work lays the groundwork for personalized and context-aware visual saliency modeling and provides public data and code to accelerate future research.

Abstract

Salient object detection is inherently a subjective problem, as observers with different priors may perceive different objects as salient. However, existing methods predominantly formulate it as an objective prediction task with a single groundtruth segmentation map for each image, which renders the problem under-determined and fundamentally ill-posed. To address this issue, we propose Observer-Centric Salient Object Detection (OC-SOD), where salient regions are predicted by considering not only the visual cues but also the observer-specific factors such as their preferences or intents. As a result, this formulation captures the intrinsic ambiguity and diversity of human perception, enabling personalized and context-aware saliency prediction. By leveraging multi-modal large language models, we develop an efficient data annotation pipeline and construct the first OC-SOD dataset named OC-SODBench, comprising 33k training, validation and test images with 152k textual prompts and object pairs. Built upon this new dataset, we further design OC-SODAgent, an agentic baseline which performs OC-SOD via a human-like "Perceive-Reflect-Adjust" process. Extensive experiments on our proposed OC-SODBench have justified the effectiveness of our contribution. Through this observer-centric perspective, we aim to bridge the gap between human perception and computational modeling, offering a more realistic and flexible understanding of what makes an object truly "salient." Code and dataset are publicly available at: https://github.com/Dustzx/OC_SOD
Paper Structure (25 sections, 3 equations, 12 figures, 5 tables, 1 algorithm)

This paper contains 25 sections, 3 equations, 12 figures, 5 tables, 1 algorithm.

Figures (12)

  • Figure 1: Illustration of traditional Salient Object Detection (SOD) limitations and our proposed Observer-Centric (OC-SOD) solution. (a) Traditional SOD aligns with human consensus in simple scenes. (b) In complex scenes, it becomes ill-posed due to inter-annotator disagreements driven by diverse subjective priors. (c) Our OC-SOD paradigm resolves this ambiguity by modeling distinct subjective contexts. This includes the "Free-viewing mode"—a feature-driven mode, as well as modes defined by specific subjective priors, such as Preference-Driven (e.g., "Foodie") and Intent-Driven (e.g., "I want to check my email"). Integrating these explicit priors renders the segmentation task well-posed and unambiguous.
  • Figure 2: An overview of our 5-step data annotation pipeline. The process begins with pre-annotated images and employs Multimodal Large Language Models (MLLMs) for key tasks. The steps are: (1) Data Filtering to remove unsuitable samples (e.g., "hard to focus," "not useful"); (2) MLLM-driven Data Categorization to analyze saliency; (3) Instruction Generation using an MLLM to create intent- or preference-based prompts; (4) automated Data Verification by an MLLM to check for errors; and (5) final Manual Curation by experts to ensure dataset quality, checking criteria such as safety, focus necessity, and relevance.
  • Figure 3: Statistical overview of the OC-SODBench dataset. (a) Word clouds of target objects, preferences, and intents. (b) Histogram of object counts. (c) The pixel-area ratio of target masks. Please zoom in for details.
  • Figure 4: Pipeline of the proposed OC-SODAgent. In the initial prediction stage (the green arrows), given the input image with the instruction, the MLLM first parses the user's intent/preference and generates an initial bounding box $B_0$, which is then processed by SAMv2 to produce an initial mask $M_0$. Based on the predicted bounding box and mask, a rendered image is synthesized, overlaying the box and the contour of the region of interest. The process then enters the Perceive–Reflect–Adjust cycle (the red arrows), where the MLLM together with SAMv2 iteratively perceive, reason, and refine the output, repeating until convergence to produce the final result $M_{final}$.
  • Figure 5: Example of visualization results under the intent-driven mode (the top two rows) and preference-driven mode (the last two rows).
  • ...and 7 more figures