Table of Contents
Fetching ...

Visual Consensus Prompting for Co-Salient Object Detection

Jie Wang, Nana Yu, Zihao Zhang, Yahong Han

TL;DR

CoSOD methods typically rely on a three-stage pipeline and full fine-tuning, which is parameter-inefficient and limits interaction between encoding and consensus. The paper introduces Visual Consensus Prompting (VCP), a parameter-efficient framework that freezes the foundation model and uses Consensus Prompt Generator (CPG) and Consensus Prompt Disperser (CPD) to embed task-specific visual consensus prompts, enabling effective CoSOD with minimal tunable parameters. The approach demonstrates state-of-the-art performance on challenging datasets (e.g., CoCA with substantial improvements in F_m) and includes extensive ablations showing the value of consensus prompts and adaptive prompt dispersion. This work highlights the feasibility and benefits of prompt-tuning for CoSOD, offering a scalable alternative to full fine-tuning on large foundation models.

Abstract

Existing co-salient object detection (CoSOD) methods generally employ a three-stage architecture (i.e., encoding, consensus extraction & dispersion, and prediction) along with a typical full fine-tuning paradigm. Although they yield certain benefits, they exhibit two notable limitations: 1) This architecture relies on encoded features to facilitate consensus extraction, but the meticulously extracted consensus does not provide timely guidance to the encoding stage. 2) This paradigm involves globally updating all parameters of the model, which is parameter-inefficient and hinders the effective representation of knowledge within the foundation model for this task. Therefore, in this paper, we propose an interaction-effective and parameter-efficient concise architecture for the CoSOD task, addressing two key limitations. It introduces, for the first time, a parameter-efficient prompt tuning paradigm and seamlessly embeds consensus into the prompts to formulate task-specific Visual Consensus Prompts (VCP). Our VCP aims to induce the frozen foundation model to perform better on CoSOD tasks by formulating task-specific visual consensus prompts with minimized tunable parameters. Concretely, the primary insight of the purposeful Consensus Prompt Generator (CPG) is to enforce limited tunable parameters to focus on co-salient representations and generate consensus prompts. The formulated Consensus Prompt Disperser (CPD) leverages consensus prompts to form task-specific visual consensus prompts, thereby arousing the powerful potential of pre-trained models in addressing CoSOD tasks. Extensive experiments demonstrate that our concise VCP outperforms 13 cutting-edge full fine-tuning models, achieving the new state of the art (with 6.8% improvement in F_m metrics on the most challenging CoCA dataset). Source code has been available at https://github.com/WJ-CV/VCP.

Visual Consensus Prompting for Co-Salient Object Detection

TL;DR

CoSOD methods typically rely on a three-stage pipeline and full fine-tuning, which is parameter-inefficient and limits interaction between encoding and consensus. The paper introduces Visual Consensus Prompting (VCP), a parameter-efficient framework that freezes the foundation model and uses Consensus Prompt Generator (CPG) and Consensus Prompt Disperser (CPD) to embed task-specific visual consensus prompts, enabling effective CoSOD with minimal tunable parameters. The approach demonstrates state-of-the-art performance on challenging datasets (e.g., CoCA with substantial improvements in F_m) and includes extensive ablations showing the value of consensus prompts and adaptive prompt dispersion. This work highlights the feasibility and benefits of prompt-tuning for CoSOD, offering a scalable alternative to full fine-tuning on large foundation models.

Abstract

Existing co-salient object detection (CoSOD) methods generally employ a three-stage architecture (i.e., encoding, consensus extraction & dispersion, and prediction) along with a typical full fine-tuning paradigm. Although they yield certain benefits, they exhibit two notable limitations: 1) This architecture relies on encoded features to facilitate consensus extraction, but the meticulously extracted consensus does not provide timely guidance to the encoding stage. 2) This paradigm involves globally updating all parameters of the model, which is parameter-inefficient and hinders the effective representation of knowledge within the foundation model for this task. Therefore, in this paper, we propose an interaction-effective and parameter-efficient concise architecture for the CoSOD task, addressing two key limitations. It introduces, for the first time, a parameter-efficient prompt tuning paradigm and seamlessly embeds consensus into the prompts to formulate task-specific Visual Consensus Prompts (VCP). Our VCP aims to induce the frozen foundation model to perform better on CoSOD tasks by formulating task-specific visual consensus prompts with minimized tunable parameters. Concretely, the primary insight of the purposeful Consensus Prompt Generator (CPG) is to enforce limited tunable parameters to focus on co-salient representations and generate consensus prompts. The formulated Consensus Prompt Disperser (CPD) leverages consensus prompts to form task-specific visual consensus prompts, thereby arousing the powerful potential of pre-trained models in addressing CoSOD tasks. Extensive experiments demonstrate that our concise VCP outperforms 13 cutting-edge full fine-tuning models, achieving the new state of the art (with 6.8% improvement in F_m metrics on the most challenging CoCA dataset). Source code has been available at https://github.com/WJ-CV/VCP.

Paper Structure

This paper contains 15 sections, 12 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Existing relevant methods VS. our VCP. (a) Existing CoSOD methods based on typical architectural patterns and full fine-tuning paradigms. (b) Introducing simple tunable parameters as visual prompts to address foreground segmentation tasks in single-scene images. (c) Our proposed VCP and some visualization results. The frozen foundation model is mined to generate task-specific visual consensus prompts (with minimized tunable parameters), thereby inducing it to effectively perform CoSOD.
  • Figure 2: Quantitative comparison of our VCP with 8 representative methods on the CoCA zhang2020gradient dataset regarding $S_m$, $F_m^{\max}$ metrics, and tunable parameters. The bubble area represents the tunable parameters (M). SCED xu2023co, GEM wu2023co, MCCL zheng2023memory, GCoNet+ zheng2023gconet+, CoPR zhu2023co, DMT li2023discriminative, and CADC++ zhang2023cadc++ are all full fine-tuning CoSOD methods. EVP liu2023explicit is based on prompt learning for SOD tasks, and we retrain it using the CoSOD dataset.
  • Figure 3: Overall framework pipeline of our proposed concise and parameter-efficient VCP model. We induce the frozen foundation model to perform better on the CoSOD task by formulating Visual Consensus Prompts with minimal tunable parameters. The proposed Consensus Prompt Generator (CPG) and Consensus Prompt Disperser (CPD) support the implementation of VCP. The CPG mines intra-group co-salient representations of the frozen embeddings to generate consensus prompts ${P_{Co}}$. The CPD utilizes ${P_{Co}}$ to form Visual Consensus Prompts and induce the frozen transformer layers to perform the CoSOD task.
  • Figure 4: Overall pipeline of the proposed CPG and CPD. The CPG utilizes predefined saliency seeds to generate saliency estimation maps through clustering, thereby obtaining consensus seeds. By selecting top-k representative consensus seeds, consensus prompts ${P_{Co}}$ are obtained. The CPD utilizes ${P_{Co}}$ to generate visual consensus prompts $P_{Visual}^{Co}$ and induce the frozen transformer layers to address the CoSOD task.
  • Figure 5: Visual comparison between our VCP and the most representative seven methods across four scenarios.