Visual Consensus Prompting for Co-Salient Object Detection
Jie Wang, Nana Yu, Zihao Zhang, Yahong Han
TL;DR
CoSOD methods typically rely on a three-stage pipeline and full fine-tuning, which is parameter-inefficient and limits interaction between encoding and consensus. The paper introduces Visual Consensus Prompting (VCP), a parameter-efficient framework that freezes the foundation model and uses Consensus Prompt Generator (CPG) and Consensus Prompt Disperser (CPD) to embed task-specific visual consensus prompts, enabling effective CoSOD with minimal tunable parameters. The approach demonstrates state-of-the-art performance on challenging datasets (e.g., CoCA with substantial improvements in F_m) and includes extensive ablations showing the value of consensus prompts and adaptive prompt dispersion. This work highlights the feasibility and benefits of prompt-tuning for CoSOD, offering a scalable alternative to full fine-tuning on large foundation models.
Abstract
Existing co-salient object detection (CoSOD) methods generally employ a three-stage architecture (i.e., encoding, consensus extraction & dispersion, and prediction) along with a typical full fine-tuning paradigm. Although they yield certain benefits, they exhibit two notable limitations: 1) This architecture relies on encoded features to facilitate consensus extraction, but the meticulously extracted consensus does not provide timely guidance to the encoding stage. 2) This paradigm involves globally updating all parameters of the model, which is parameter-inefficient and hinders the effective representation of knowledge within the foundation model for this task. Therefore, in this paper, we propose an interaction-effective and parameter-efficient concise architecture for the CoSOD task, addressing two key limitations. It introduces, for the first time, a parameter-efficient prompt tuning paradigm and seamlessly embeds consensus into the prompts to formulate task-specific Visual Consensus Prompts (VCP). Our VCP aims to induce the frozen foundation model to perform better on CoSOD tasks by formulating task-specific visual consensus prompts with minimized tunable parameters. Concretely, the primary insight of the purposeful Consensus Prompt Generator (CPG) is to enforce limited tunable parameters to focus on co-salient representations and generate consensus prompts. The formulated Consensus Prompt Disperser (CPD) leverages consensus prompts to form task-specific visual consensus prompts, thereby arousing the powerful potential of pre-trained models in addressing CoSOD tasks. Extensive experiments demonstrate that our concise VCP outperforms 13 cutting-edge full fine-tuning models, achieving the new state of the art (with 6.8% improvement in F_m metrics on the most challenging CoCA dataset). Source code has been available at https://github.com/WJ-CV/VCP.
