Table of Contents
Fetching ...

Improving SAM for Camouflaged Object Detection via Dual Stream Adapters

Jiaming Liu, Linghe Kong, Guihai Chen

TL;DR

The paper addresses camouflaged object detection by extending Segment Anything Model (SAM) with dual-stream adapters for RGB-D inputs, enabling complementary semantic and structural cues to guide segmentation. It introduces bidirectional knowledge distillation and mixed prompt embedding to harmonize RGB and depth representations and prompts while keeping the SAM backbone largely intact. The approach achieves state-of-the-art or competitive results on four COD benchmarks, demonstrating strong gains in both RGB-only and RGB-D settings and showing the value of task-specific adapters within a visual foundation model. This has practical implications for robust COD in challenging environments and suggests adaptable pathways for multimodal extensions of foundation models.

Abstract

Segment anything model (SAM) has shown impressive general-purpose segmentation performance on natural images, but its performance on camouflaged object detection (COD) is unsatisfactory. In this paper, we propose SAM-COD that performs camouflaged object detection for RGB-D inputs. While keeping the SAM architecture intact, dual stream adapters are expanded on the image encoder to learn potential complementary information from RGB images and depth images, and fine-tune the mask decoder and its depth replica to perform dual-stream mask prediction. In practice, the dual stream adapters are embedded into the attention block of the image encoder in a parallel manner to facilitate the refinement and correction of the two types of image embeddings. To mitigate channel discrepancies arising from dual stream embeddings that do not directly interact with each other, we augment the association of dual stream embeddings using bidirectional knowledge distillation including a model distiller and a modal distiller. In addition, to predict the masks for RGB and depth attention maps, we hybridize the two types of image embeddings which are jointly learned with the prompt embeddings to update the initial prompt, and then feed them into the mask decoders to synchronize the consistency of image embeddings and prompt embeddings. Experimental results on four COD benchmarks show that our SAM-COD achieves excellent detection performance gains over SAM and achieves state-of-the-art results with a given fine-tuning paradigm.

Improving SAM for Camouflaged Object Detection via Dual Stream Adapters

TL;DR

The paper addresses camouflaged object detection by extending Segment Anything Model (SAM) with dual-stream adapters for RGB-D inputs, enabling complementary semantic and structural cues to guide segmentation. It introduces bidirectional knowledge distillation and mixed prompt embedding to harmonize RGB and depth representations and prompts while keeping the SAM backbone largely intact. The approach achieves state-of-the-art or competitive results on four COD benchmarks, demonstrating strong gains in both RGB-only and RGB-D settings and showing the value of task-specific adapters within a visual foundation model. This has practical implications for robust COD in challenging environments and suggests adaptable pathways for multimodal extensions of foundation models.

Abstract

Segment anything model (SAM) has shown impressive general-purpose segmentation performance on natural images, but its performance on camouflaged object detection (COD) is unsatisfactory. In this paper, we propose SAM-COD that performs camouflaged object detection for RGB-D inputs. While keeping the SAM architecture intact, dual stream adapters are expanded on the image encoder to learn potential complementary information from RGB images and depth images, and fine-tune the mask decoder and its depth replica to perform dual-stream mask prediction. In practice, the dual stream adapters are embedded into the attention block of the image encoder in a parallel manner to facilitate the refinement and correction of the two types of image embeddings. To mitigate channel discrepancies arising from dual stream embeddings that do not directly interact with each other, we augment the association of dual stream embeddings using bidirectional knowledge distillation including a model distiller and a modal distiller. In addition, to predict the masks for RGB and depth attention maps, we hybridize the two types of image embeddings which are jointly learned with the prompt embeddings to update the initial prompt, and then feed them into the mask decoders to synchronize the consistency of image embeddings and prompt embeddings. Experimental results on four COD benchmarks show that our SAM-COD achieves excellent detection performance gains over SAM and achieves state-of-the-art results with a given fine-tuning paradigm.

Paper Structure

This paper contains 13 sections, 7 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Illustration of the proposed SAM-COD. The two learnable adapters based on the SAM model are extended to act on paired input streams of RGB images and depth images, respectively. Subject to co-supervision, $Ada_{RGB}$ focuses on perceiving pixel semantics to segment objects and $Ada_{Depth}$ focuses on structural transformation to separate objects from the background.
  • Figure 2: Overall pipeline of our SAM-COD. The dual stream images are fed into SAM in parallel to extract image features that are fine-tuned by the respective adapters. The knowledge distiller is used to address the differences caused by the dual-stream features being decoded without direct interaction. The initialized box prompt is mixed with the image features to generate a more refined dense prompt embedding. Finally, the prediction results of the two classes of feature maps are weighted summed to obtain the final detection results.
  • Figure 3: Illustration of the proposed adapter. The patched embedding of the image is used as input and our dual-stream adapter extracts the image embedding with high-frequency details by feature projection transform and discrete wavelet transform.
  • Figure 4: Illustration of the proposed bidirectional knowledge distillation. The model distillation from the pre-trained image encoder to the fine-tuned RGB adapter, and modal distillation from the RGB adapter to the depth adapter are executed sequentially.
  • Figure 5: Comparison of our SAM-COD and other methods in the COD task. We are mainly concerned with those of SAM-based methods.
  • ...and 2 more figures