Table of Contents
Fetching ...

VSCode: General Visual Salient and Camouflaged Object Detection with 2D Prompt Learning

Ziyang Luo, Nian Liu, Wangbo Zhao, Xuguang Yang, Dingwen Zhang, Deng-Ping Fan, Fahad Khan, Junwei Han

TL;DR

VSCode presents a generalist model for multimodal salient and camouflaged object detection by combining a foundation segmentation model with 2D prompts that separately encode domain and task peculiarities. The approach leverages a Swin-based VST backbone, domain-specific prompts inserted in the encoder, and task-specific prompts in both the encoder and decoder, augmented by a prompt discrimination loss to disentangle knowledge and improve generalization. Trained jointly on four SOD tasks and three COD tasks, it achieves state-of-the-art results across 26 datasets and demonstrates zero-shot generalization to unseen tasks by mixing prompts (e.g., RGB-D COD). This work highlights the efficiency and scalability of prompt-based generalist models for complex multimodal segmentation, with practical implications for reducing task-specific model proliferation. The availability of source code further enables adoption and extension to new multimodal detection scenarios.

Abstract

Salient object detection (SOD) and camouflaged object detection (COD) are related yet distinct binary mapping tasks. These tasks involve multiple modalities, sharing commonalities and unique cues. Existing research often employs intricate task-specific specialist models, potentially leading to redundancy and suboptimal results. We introduce VSCode, a generalist model with novel 2D prompt learning, to jointly address four SOD tasks and three COD tasks. We utilize VST as the foundation model and introduce 2D prompts within the encoder-decoder architecture to learn domain and task-specific knowledge on two separate dimensions. A prompt discrimination loss helps disentangle peculiarities to benefit model optimization. VSCode outperforms state-of-the-art methods across six tasks on 26 datasets and exhibits zero-shot generalization to unseen tasks by combining 2D prompts, such as RGB-D COD. Source code has been available at https://github.com/Sssssuperior/VSCode.

VSCode: General Visual Salient and Camouflaged Object Detection with 2D Prompt Learning

TL;DR

VSCode presents a generalist model for multimodal salient and camouflaged object detection by combining a foundation segmentation model with 2D prompts that separately encode domain and task peculiarities. The approach leverages a Swin-based VST backbone, domain-specific prompts inserted in the encoder, and task-specific prompts in both the encoder and decoder, augmented by a prompt discrimination loss to disentangle knowledge and improve generalization. Trained jointly on four SOD tasks and three COD tasks, it achieves state-of-the-art results across 26 datasets and demonstrates zero-shot generalization to unseen tasks by mixing prompts (e.g., RGB-D COD). This work highlights the efficiency and scalability of prompt-based generalist models for complex multimodal segmentation, with practical implications for reducing task-specific model proliferation. The availability of source code further enables adoption and extension to new multimodal detection scenarios.

Abstract

Salient object detection (SOD) and camouflaged object detection (COD) are related yet distinct binary mapping tasks. These tasks involve multiple modalities, sharing commonalities and unique cues. Existing research often employs intricate task-specific specialist models, potentially leading to redundancy and suboptimal results. We introduce VSCode, a generalist model with novel 2D prompt learning, to jointly address four SOD tasks and three COD tasks. We utilize VST as the foundation model and introduce 2D prompts within the encoder-decoder architecture to learn domain and task-specific knowledge on two separate dimensions. A prompt discrimination loss helps disentangle peculiarities to benefit model optimization. VSCode outperforms state-of-the-art methods across six tasks on 26 datasets and exhibits zero-shot generalization to unseen tasks by combining 2D prompts, such as RGB-D COD. Source code has been available at https://github.com/Sssssuperior/VSCode.
Paper Structure (21 sections, 4 equations, 5 figures, 9 tables)

This paper contains 21 sections, 4 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: Relationship of SOD, COD, and multimodal tasks. Each specific task is seen as a combination of two dimensions, i.e. domain (RGB/Depth/Thermal/Flow) and task (SOD/COD).
  • Figure 2: Overall architecture of our VSCode model. We use VST Liu_2021_ICCV as the foundation model to acquire commonalities among multimodal SOD and COD tasks. For each task, we integrate 2D prompts to aggregate peculiarities along the domain dimension and the task dimension, including four domain-specific prompts and two task-specific prompts.
  • Figure 3: Overall framework of our proposed VSCode model with 2D prompt learning. Based on the VST Liu_2021_ICCV foundation model, we insert the respective domain-specific prompts and task-specific prompts in the attention windows in the Swin transformer liu2021Swin encoder layers to learn domain and task-specific encoder features. The convertor is used for multimodal feature fusion. Within the transformer decoder layers, task-specific prompts are appended to image feature tokens to perform task-specific decoding. We also provide detailed structures of an encoder layer ($i=0$) and a decoder layer ($j=0$).
  • Figure 4: Illustration of the influence of using different task prompts.
  • Figure 5: Correlation of prompt pairs at each encoder block.