Table of Contents
Fetching ...

Determined by User Needs: A Salient Object Detection Rationale Beyond Conventional Visual Stimuli

Chenglizhao Chen, Shujian Zhang, Luming Li, Wenfeng Song, Shuai Li

Abstract

Existing \textbf{s}alient \textbf{o}bject \textbf{d}etection (SOD) methods adopt a \textbf{passive} visual stimulus-based rationale--objects with the strongest visual stimuli are perceived as the user's primary focus (i.e., salient objects). They ignore the decisive role of users' \textbf{proactive needs} in segmenting salient objects--if a user has a need before seeing an image, the user's salient objects align with their needs, e.g., if a user's need is ``white apple'', when this user sees an image, the user's primary focus is on the ``white apple'' or ``the most white apple-like'' objects in the image. Such an oversight not only \textbf{fails to satisfy users}, but also \textbf{limits the development of downstream tasks}. For instance, in salient object ranking tasks, focusing solely on visual stimuli-based salient objects is insufficient for conducting an analysis of fine-grained relationships between users' viewing order (usually determined by user's needs) and scenes, which may result in wrong ranking results. Clearly, it is essential to detect salient objects based on user needs. Thus, we advocate a \textbf{User} \textbf{S}alient \textbf{O}bject \textbf{D}etection (UserSOD) task, which focuses on \textbf{detecting salient objects align with users' proactive needs when user have needs}. The main challenge for this new task is the lack of datasets for model training and testing.

Determined by User Needs: A Salient Object Detection Rationale Beyond Conventional Visual Stimuli

Abstract

Existing \textbf{s}alient \textbf{o}bject \textbf{d}etection (SOD) methods adopt a \textbf{passive} visual stimulus-based rationale--objects with the strongest visual stimuli are perceived as the user's primary focus (i.e., salient objects). They ignore the decisive role of users' \textbf{proactive needs} in segmenting salient objects--if a user has a need before seeing an image, the user's salient objects align with their needs, e.g., if a user's need is ``white apple'', when this user sees an image, the user's primary focus is on the ``white apple'' or ``the most white apple-like'' objects in the image. Such an oversight not only \textbf{fails to satisfy users}, but also \textbf{limits the development of downstream tasks}. For instance, in salient object ranking tasks, focusing solely on visual stimuli-based salient objects is insufficient for conducting an analysis of fine-grained relationships between users' viewing order (usually determined by user's needs) and scenes, which may result in wrong ranking results. Clearly, it is essential to detect salient objects based on user needs. Thus, we advocate a \textbf{User} \textbf{S}alient \textbf{O}bject \textbf{D}etection (UserSOD) task, which focuses on \textbf{detecting salient objects align with users' proactive needs when user have needs}. The main challenge for this new task is the lack of datasets for model training and testing.

Paper Structure

This paper contains 14 sections, 5 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Motivation demonstration of our new task. Current SOTA methods (A) always segment salient object based on passive visual stimulus-based rationales, limiting the performance of downstream tasks (e.g., salient object ranking, see details in the Fig. \ref{['fig:subtask']}). Therefore, we introduce a novel task, UserSOD (B), which detects salient object based on user need, as it accounts for the crucial role of user active need, thereby aligning more closely with real users' primary focus mechanisms.
  • Figure 2: Application of salient object detection method to salient object ranking (SOR). (A) The existing SOTA SOD methods ignores user active need, leading to constant rank objects' saliency that fail to satisfy the user needs and deviate from the ground truth (GT, Bottom left). In contrast, our approach (B) can return more fine-grained rank results, better meeting user needs.
  • Figure 3: The pipeline of our method. Our consists of two components, i.e., user need digger (UND) and user need-driven salient object detection model (Uersal$^+$). UND semi-automatically digs user need commands based on existing samples, and Uersal$^+$ can leverage fine-grained clues from user need commands to detect salient objects.
  • Figure 4: Existing Sets v.s. Our UserSOD set. Compared to existing set, Users contains samples which are image, corresponding user need commands, and corresponding ground truths (GT). Where, i$\subset$(1, $+ \infty$) and j$\subset$(1, $+ \infty$) denote the number of mask and user need command in single samples, respectively.
  • Figure 5: The proposed User Need Digger.
  • ...and 3 more figures