Table of Contents
Fetching ...

TF-SSD: A Strong Pipeline via Synergic Mask Filter for Training-free Co-salient Object Detection

Zhijin He, Shuo Jin, Siyue Yu, Shuwei Wu, Bingfeng Zhang, Li Yu, Jimin Xiao

Abstract

Co-salient Object Detection (CoSOD) aims to segment salient objects that consistently appear across a group of related images. Despite the notable progress achieved by recent training-based approaches, they still remain constrained by the closed-set datasets and exhibit limited generalization. However, few studies explore the potential of Vision Foundation Models (VFMs) to address CoSOD, which demonstrate a strong generalized ability and robust saliency understanding. In this paper, we investigate and leverage VFMs for CoSOD, and further propose a novel training-free method, TF-SSD, through the synergy between SAM and DINO. Specifically, we first utilize SAM to generate comprehensive raw proposals, which serve as a candidate mask pool. Then, we introduce a quality mask generator to filter out redundant masks, thereby acquiring a refined mask set. Since this generator is built upon SAM, it inherently lacks semantic understanding of saliency. To this end, we adopt an intra-image saliency filter that employs DINO's attention maps to identify visually salient masks within individual images. Moreover, to extend saliency understanding across group images, we propose an inter-image prototype selector, which computes similarity scores among cross-image prototypes to select masks with the highest score. These selected masks serve as final predictions for CoSOD. Extensive experiments show that our TF-SSD outperforms existing methods (e.g., 13.7\% gains over the recent training-free method). Codes are available at https://github.com/hzz-yy/TF-SSD.

TF-SSD: A Strong Pipeline via Synergic Mask Filter for Training-free Co-salient Object Detection

Abstract

Co-salient Object Detection (CoSOD) aims to segment salient objects that consistently appear across a group of related images. Despite the notable progress achieved by recent training-based approaches, they still remain constrained by the closed-set datasets and exhibit limited generalization. However, few studies explore the potential of Vision Foundation Models (VFMs) to address CoSOD, which demonstrate a strong generalized ability and robust saliency understanding. In this paper, we investigate and leverage VFMs for CoSOD, and further propose a novel training-free method, TF-SSD, through the synergy between SAM and DINO. Specifically, we first utilize SAM to generate comprehensive raw proposals, which serve as a candidate mask pool. Then, we introduce a quality mask generator to filter out redundant masks, thereby acquiring a refined mask set. Since this generator is built upon SAM, it inherently lacks semantic understanding of saliency. To this end, we adopt an intra-image saliency filter that employs DINO's attention maps to identify visually salient masks within individual images. Moreover, to extend saliency understanding across group images, we propose an inter-image prototype selector, which computes similarity scores among cross-image prototypes to select masks with the highest score. These selected masks serve as final predictions for CoSOD. Extensive experiments show that our TF-SSD outperforms existing methods (e.g., 13.7\% gains over the recent training-free method). Codes are available at https://github.com/hzz-yy/TF-SSD.

Paper Structure

This paper contains 27 sections, 13 equations, 9 figures, 9 tables, 1 algorithm.

Figures (9)

  • Figure 1: Segmentation results derived from SAM. The oracle masks selected under the ground truth guidance exhibit co-salient representations, which are highlighted in the green rectangle.
  • Figure 2: Comparison of our TF-SSD and other CoSOD methods. $F_m$ and $S_m$ denote the F-measure and S-measure, respectively. Our method surpasses training-based methods and achieves 13.7% improvement over the recent training-free SOTA method.
  • Figure 3: Overview pipeline of our TF-SSD. It contains four components: a Quality Mask Generator (QMG) that filters out exhaustive candidate masks from SAM, a DINO encoder for salient attention and semantic prototype extraction, an Intra-image Saliency Filter (ISF) for visually salient purification within a single image, and an Inter-image Prototype Selector (IPS) that builds the saliency relationship across group images to discover the co-salient masks as the final predictions for CoSOD. For clarity, our notations are defined as: $m_{n,t}$ denotes an individual mask, $\textbf{M}_n$ denotes the mask set of image $n$ (e.g., $\textbf{M}_n^{raw}$, $\textbf{M}_n^{refine}$).
  • Figure 4: Illustration of Intra-image Saliency Filter. Row 1: Original image, self-attention map of DINO, and ground truth. Row 2: Three mask proposals generated by QMG. Row 3: Attention response obtained by element-wise multiplication of each mask with the attention map. Mask 1 achieves a high saliency score due to strong overlap between salient regions and the mask proposal, while Masks 2 and 3 receive low scores due to non-overlap.
  • Figure 5: Illustration of Inter-image Prototype Selector (IPS) with an example of N=3 images and T=3 masks per image. A pairwise similarity matrix is constructed using mask prototypes, where different colors denote prototypes from different images. For each candidate (e.g., $p_{1,1}$ of image $I_1$), we select its maximum score (deep red cells) against the candidates from each of the other images ($I_2$ and $I_3$). These maximum scores are summed as the total co-saliency score $s^{co}_{max}$. The mask with the highest $s^{co}_{max}$ is selected as the final prediction.
  • ...and 4 more figures