Table of Contents
Fetching ...

Zero-Shot Co-salient Object Detection Framework

Haoke Xiao, Lv Tang, Bo Li, Zhiming Luo, Shaozi Li

TL;DR

The paper addresses Co-salient object detection without training by leveraging foundational CV models to build cross-image group representations. It introduces Group Prompt Generation (GPG) and Co-saliency Map Generation (CMP), which use fused DINO (high-level) and Stable Diffusion (low-level) features to produce per-image group prompts for a SAM-based co-saliency mapper, with all components frozen ($t=50$, $K=2$ prompts per image). Experiments on Cosal2015, CoSOD3k, and CoCA show the approach surpasses unsupervised state-of-the-art and remains competitive with supervised methods from 2020–2022, validating a strong zero-shot performance. This framework demonstrates the viability of applying foundational models to CoSOD tasks without fine-tuning, potentially guiding future unsupervised CoSOD research.

Abstract

Co-salient Object Detection (CoSOD) endeavors to replicate the human visual system's capacity to recognize common and salient objects within a collection of images. Despite recent advancements in deep learning models, these models still rely on training with well-annotated CoSOD datasets. The exploration of training-free zero-shot CoSOD frameworks has been limited. In this paper, taking inspiration from the zero-shot transfer capabilities of foundational computer vision models, we introduce the first zero-shot CoSOD framework that harnesses these models without any training process. To achieve this, we introduce two novel components in our proposed framework: the group prompt generation (GPG) module and the co-saliency map generation (CMP) module. We evaluate the framework's performance on widely-used datasets and observe impressive results. Our approach surpasses existing unsupervised methods and even outperforms fully supervised methods developed before 2020, while remaining competitive with some fully supervised methods developed before 2022.

Zero-Shot Co-salient Object Detection Framework

TL;DR

The paper addresses Co-salient object detection without training by leveraging foundational CV models to build cross-image group representations. It introduces Group Prompt Generation (GPG) and Co-saliency Map Generation (CMP), which use fused DINO (high-level) and Stable Diffusion (low-level) features to produce per-image group prompts for a SAM-based co-saliency mapper, with all components frozen (, prompts per image). Experiments on Cosal2015, CoSOD3k, and CoCA show the approach surpasses unsupervised state-of-the-art and remains competitive with supervised methods from 2020–2022, validating a strong zero-shot performance. This framework demonstrates the viability of applying foundational models to CoSOD tasks without fine-tuning, potentially guiding future unsupervised CoSOD research.

Abstract

Co-salient Object Detection (CoSOD) endeavors to replicate the human visual system's capacity to recognize common and salient objects within a collection of images. Despite recent advancements in deep learning models, these models still rely on training with well-annotated CoSOD datasets. The exploration of training-free zero-shot CoSOD frameworks has been limited. In this paper, taking inspiration from the zero-shot transfer capabilities of foundational computer vision models, we introduce the first zero-shot CoSOD framework that harnesses these models without any training process. To achieve this, we introduce two novel components in our proposed framework: the group prompt generation (GPG) module and the co-saliency map generation (CMP) module. We evaluate the framework's performance on widely-used datasets and observe impressive results. Our approach surpasses existing unsupervised methods and even outperforms fully supervised methods developed before 2020, while remaining competitive with some fully supervised methods developed before 2022.
Paper Structure (10 sections, 4 equations, 4 figures, 2 tables)

This paper contains 10 sections, 4 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Left: The architecture of our proposed zero-shot CoSOD framework. Right: The performance of our proposed zero-shot CoSOD framework. GWD DBLP:conf/ijcai/WeiZBLW17, RCAN DBLP:conf/ijcai/0061STSS19, ICNet DBLP:conf/nips/Jin0CZG20, CADC DBLP:conf/iccv/ZhangHL021 and UFO su2023unified are five typical methods.
  • Figure 2: The architecture of our proposed zero-shot CoSOD framework. Feature extraction is accomplished by utilizing DINO and SD to extract both high-level and low-level information. The CMP module employs SAM to generate the co-saliency maps. Importantly, all parameters in the network remain frozen, eliminating the need for additional training.
  • Figure 3: The generated group features.
  • Figure 4: Visual comparison between our method and other methods.