Table of Contents
Fetching ...

CamSAM2: Segment Anything Accurately in Camouflaged Videos

Yuli Zhou, Yawei Li, Yuqian Fu, Luca Benini, Ender Konukoglu, Guolei Sun

TL;DR

CamSAM2 tackles video camouflaged object segmentation by augmenting SAM2 with a learnable decamouflaged token and two fusion modules plus an object prototype memory. The IOF and EOF modules fuse high-resolution frame details and cross-frame object prototypes to refine features, while OPG stores informative object prototypes for temporal guidance. The approach yields large improvements over SAM2 across MoCA-Mask, CAD, and SUN-SEG, with strong zero-shot performance and only modest compute overhead. By preserving SAM2’s generalization and adding targeted camouflage-aware enhancements, CamSAM2 achieves state-of-the-art VCOS performance on public benchmarks and offers practical, prompt-efficient VCOS solutions.

Abstract

Video camouflaged object segmentation (VCOS), aiming at segmenting camouflaged objects that seamlessly blend into their environment, is a fundamental vision task with various real-world applications. With the release of SAM2, video segmentation has witnessed significant progress. However, SAM2's capability of segmenting camouflaged videos is suboptimal, especially when given simple prompts such as point and box. To address the problem, we propose Camouflaged SAM2 (CamSAM2), which enhances SAM2's ability to handle camouflaged scenes without modifying SAM2's parameters. Specifically, we introduce a decamouflaged token to provide the flexibility of feature adjustment for VCOS. To make full use of fine-grained and high-resolution features from the current frame and previous frames, we propose implicit object-aware fusion (IOF) and explicit object-aware fusion (EOF) modules, respectively. Object prototype generation (OPG) is introduced to abstract and memorize object prototypes with informative details using high-quality features from previous frames. Extensive experiments are conducted to validate the effectiveness of our approach. While CamSAM2 only adds negligible learnable parameters to SAM2, it substantially outperforms SAM2 on three VCOS datasets, especially achieving 12.2 mDice gains with click prompt on MoCA-Mask and 19.6 mDice gains with mask prompt on SUN-SEG-Hard, with Hiera-T as the backbone. The code is available at https://github.com/zhoustan/CamSAM2.

CamSAM2: Segment Anything Accurately in Camouflaged Videos

TL;DR

CamSAM2 tackles video camouflaged object segmentation by augmenting SAM2 with a learnable decamouflaged token and two fusion modules plus an object prototype memory. The IOF and EOF modules fuse high-resolution frame details and cross-frame object prototypes to refine features, while OPG stores informative object prototypes for temporal guidance. The approach yields large improvements over SAM2 across MoCA-Mask, CAD, and SUN-SEG, with strong zero-shot performance and only modest compute overhead. By preserving SAM2’s generalization and adding targeted camouflage-aware enhancements, CamSAM2 achieves state-of-the-art VCOS performance on public benchmarks and offers practical, prompt-efficient VCOS solutions.

Abstract

Video camouflaged object segmentation (VCOS), aiming at segmenting camouflaged objects that seamlessly blend into their environment, is a fundamental vision task with various real-world applications. With the release of SAM2, video segmentation has witnessed significant progress. However, SAM2's capability of segmenting camouflaged videos is suboptimal, especially when given simple prompts such as point and box. To address the problem, we propose Camouflaged SAM2 (CamSAM2), which enhances SAM2's ability to handle camouflaged scenes without modifying SAM2's parameters. Specifically, we introduce a decamouflaged token to provide the flexibility of feature adjustment for VCOS. To make full use of fine-grained and high-resolution features from the current frame and previous frames, we propose implicit object-aware fusion (IOF) and explicit object-aware fusion (EOF) modules, respectively. Object prototype generation (OPG) is introduced to abstract and memorize object prototypes with informative details using high-quality features from previous frames. Extensive experiments are conducted to validate the effectiveness of our approach. While CamSAM2 only adds negligible learnable parameters to SAM2, it substantially outperforms SAM2 on three VCOS datasets, especially achieving 12.2 mDice gains with click prompt on MoCA-Mask and 19.6 mDice gains with mask prompt on SUN-SEG-Hard, with Hiera-T as the backbone. The code is available at https://github.com/zhoustan/CamSAM2.

Paper Structure

This paper contains 43 sections, 7 equations, 9 figures, 16 tables.

Figures (9)

  • Figure 1: Illustration of SAM2 and CamSAM2. Top: SAM2's segmentation of the camouflaged object is suboptimal, primarily because its feature optimization is biased toward natural videos, and its design does not account for the unique challenges inherent to VCOS. Bottom: CamSAM2 improves SAM2's ability to segment and track camouflaged objects by introducing a decamouflaged token, IOF to enhance features with high-resolution features, and EOF and OPG to further enhance features by exploiting informative object details across time. CamSAM2 only adds a limited number of parameters to SAM2 while keeping all SAM2's parameters fixed and fully inheriting SAM2's zero-shot ability. The segmentation result is overlaid in orange on the frame.
  • Figure 2: Overall architecture of CamSAM2. CamSAM2 effectively captures and segments camouflaged objects by leveraging implicit and explicit object-aware information from the current or previous frames. It includes the following key components: (a) the decamouflaged token, which extends SAM2's token structure to learn features suitable for camouflaged objects; (b) an IOF module to enrich memory-conditioned features with implicitly object-aware high-resolution features; (c) an EOF module to aggregate explicit object-aware features; and (d) an OPG module, generating informative object prototypes, which guides cross-attention in EOF. These components work together to preserve fine details, enhance segmentation quality, and track camouflaged objects across time.
  • Figure 3: Qualitative comparisons between SAM2 and CamSAM2 using 1-click prompt with the Hiera-T backbone on two MoCA-Mask clips. From top to bottom: the input frames, SAM2's results, CamSAM2's results, and ground-truth masks. CamSAM2 demonstrates improved accuracy in VCOS, especially in complex backgrounds, as shown by the circles. Best viewed in color.
  • Figure 4: Attention map visualization from SAM2 and CamSAM2 using point prompts with the Hiera-T backbone. From top to bottom: input frames, attention with SAM2 mask token, attention with decamouflaged token, and ground-truth masks. The higher attention regions are indicated by warmer colors.
  • Figure 5: Illustration of the architecture toggle. The toggle switch enables or disables the proposed modules for VCOS containing the decamouflaged token, IOF, EOF, and OPG. Modules and flows in dashed lines indicate the disabled state.
  • ...and 4 more figures