Table of Contents
Fetching ...

Towards Real Zero-Shot Camouflaged Object Segmentation without Camouflaged Annotations

Cheng Lei, Jie Fan, Xinran Li, Tianzhu Xiang, Ao Li, Ce Zhu, Le Zhang

TL;DR

This work examines the learned attention patterns for camouflaged objects and introduces a robust zero-shot COS framework that adapts with the inherent local pattern bias of COS while incorporating global attention patterns and a broad semantic feature space derived from SOS, enabling efficient zero-shot transfer for COS.

Abstract

Camouflaged Object Segmentation (COS) faces significant challenges due to the scarcity of annotated data, where meticulous pixel-level annotation is both labor-intensive and costly, primarily due to the intricate object-background boundaries. Addressing the core question, "Can COS be effectively achieved in a zero-shot manner without manual annotations for any camouflaged object?" we affirmatively respond and introduce a robust zero-shot COS framework. This framework leverages the inherent local pattern bias of COS and employs a broad semantic feature space derived from salient object segmentation (SOS) for efficient zero-shot transfer. We incorporate an Masked Image Modeling (MIM) based image encoder optimized for Parameter-Efficient Fine-Tuning (PEFT), a Multimodal Large Language Model (M-LLM), and a Multi-scale Fine-grained Alignment (MFA) mechanism. The MIM pre-trained image encoder focuses on capturing essential low-level features, while the M-LLM generates caption embeddings processed alongside these visual cues. These embeddings are precisely aligned using MFA, enabling our framework to accurately interpret and navigate complex semantic contexts. To optimize operational efficiency, we introduce a learnable codebook that represents the M-LLM during inference, significantly reducing computational overhead. Our framework demonstrates its versatility and efficacy through rigorous experimentation, achieving state-of-the-art performance in zero-shot COS with $F_β^w$ scores of 72.9\% on CAMO and 71.7\% on COD10K. By removing the M-LLM during inference, we achieve an inference speed comparable to that of traditional end-to-end models, reaching 18.1 FPS. Code: https://github.com/R-LEI360725/ZSCOS-CaMF

Towards Real Zero-Shot Camouflaged Object Segmentation without Camouflaged Annotations

TL;DR

This work examines the learned attention patterns for camouflaged objects and introduces a robust zero-shot COS framework that adapts with the inherent local pattern bias of COS while incorporating global attention patterns and a broad semantic feature space derived from SOS, enabling efficient zero-shot transfer for COS.

Abstract

Camouflaged Object Segmentation (COS) faces significant challenges due to the scarcity of annotated data, where meticulous pixel-level annotation is both labor-intensive and costly, primarily due to the intricate object-background boundaries. Addressing the core question, "Can COS be effectively achieved in a zero-shot manner without manual annotations for any camouflaged object?" we affirmatively respond and introduce a robust zero-shot COS framework. This framework leverages the inherent local pattern bias of COS and employs a broad semantic feature space derived from salient object segmentation (SOS) for efficient zero-shot transfer. We incorporate an Masked Image Modeling (MIM) based image encoder optimized for Parameter-Efficient Fine-Tuning (PEFT), a Multimodal Large Language Model (M-LLM), and a Multi-scale Fine-grained Alignment (MFA) mechanism. The MIM pre-trained image encoder focuses on capturing essential low-level features, while the M-LLM generates caption embeddings processed alongside these visual cues. These embeddings are precisely aligned using MFA, enabling our framework to accurately interpret and navigate complex semantic contexts. To optimize operational efficiency, we introduce a learnable codebook that represents the M-LLM during inference, significantly reducing computational overhead. Our framework demonstrates its versatility and efficacy through rigorous experimentation, achieving state-of-the-art performance in zero-shot COS with scores of 72.9\% on CAMO and 71.7\% on COD10K. By removing the M-LLM during inference, we achieve an inference speed comparable to that of traditional end-to-end models, reaching 18.1 FPS. Code: https://github.com/R-LEI360725/ZSCOS-CaMF

Paper Structure

This paper contains 22 sections, 14 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Examples of salient and camouflaged objects. In salient object segmentation (SOS), objects are easily distinguishable from the background, while in camouflaged object segmentation (COS), objects blend into their surroundings. The attention maps reveal that SOS focuses on semantic information, whereas COS emphasizes edge features to detect camouflaged objects.
  • Figure 2: Overview of different learning pipeline to COS. In Supervised COS (a), camouflaged objects are used for both training and testing SINetSINetV2ZoomNetFSPNet. In Zero-Shot COD ZSCOD & Open-Vocabulary OVCOS (b), training data consists of a specific set of camouflaged objects, while testing data includes unseen camouflaged objects. Our Zero-shot COS (c) setting offers a more practical approach as it is trained without any camouflaged annotations. Instead, we utilize a saliency dataset, which is more readily available, cost-effective.
  • Figure 3: Overall Architecture of the Proposed Framework. In (a), PEFT module is employed during the training phase. Specifically, only PEFT module, the MFA, the query, and the simple mask decoder are fine-tuned, while the remaining parts of the architecture are kept frozen. During inference, the M-LLM is removed and the learned query replaces the caption embeddings. (b) illustrates the implementation of PEFT using Adapter AdaptFormer on the transformer block EVA02 within the image encoder. The bias terms is omitted in the figure.
  • Figure 4: Multi-scale Token Match Operation and the structure of Text Mixer. The left figure highlights how the multi-scale vision tokens and caption tokens are aligned using TM. The right figure provides a closer look at the text mixer, which is composed of two mixer blocks and a down-scaling module.
  • Figure 5: The structure of simple mask decoder. We use several MLPs and upsampling modules to maintain a simple design.
  • ...and 3 more figures