Table of Contents
Fetching ...

Open-Vocabulary Camouflaged Object Segmentation with Cascaded Vision Language Models

Kai Zhao, Wubang Yuan, Zheng Wang, Guanyi Li, Xiaoqiang Zhu, Deng-ping Fan, Dan Zeng

TL;DR

This paper introduces a novel VLM-guided cascaded framework, which treats the segmentation output as a soft spatial prior via the alpha channel, which retains the full image context while providing precise spatial guidance, leading to more accurate and context-aware classification of camouflaged objects.

Abstract

Open-Vocabulary Camouflaged Object Segmentation (OVCOS) seeks to segment and classify camouflaged objects from arbitrary categories, presenting unique challenges due to visual ambiguity and unseen categories.Recent approaches typically adopt a two-stage paradigm: first segmenting objects, then classifying the segmented regions using Vision Language Models (VLMs).However, these methods (1) suffer from a domain gap caused by the mismatch between VLMs' full-image training and cropped-region inference, and (2) depend on generic segmentation models optimized for well-delineated objects, making them less effective for camouflaged objects.Without explicit guidance, generic segmentation models often overlook subtle boundaries, leading to imprecise segmentation.In this paper,we introduce a novel VLM-guided cascaded framework to address these issues in OVCOS.For segmentation, we leverage the Segment Anything Model (SAM), guided by the VLM.Our framework uses VLM-derived features as explicit prompts to SAM, effectively directing attention to camouflaged regions and significantly improving localization accuracy.For classification, we avoid the domain gap introduced by hard cropping.Instead, we treat the segmentation output as a soft spatial prior via the alpha channel, which retains the full image context while providing precise spatial guidance, leading to more accurate and context-aware classification of camouflaged objects.The same VLM is shared across both segmentation and classification to ensure efficiency and semantic consistency.Extensive experiments on both OVCOS and conventional camouflaged object segmentation benchmarks demonstrate the clear superiority of our method, highlighting the effectiveness of leveraging rich VLM semantics for both segmentation and classification of camouflaged objects.

Open-Vocabulary Camouflaged Object Segmentation with Cascaded Vision Language Models

TL;DR

This paper introduces a novel VLM-guided cascaded framework, which treats the segmentation output as a soft spatial prior via the alpha channel, which retains the full image context while providing precise spatial guidance, leading to more accurate and context-aware classification of camouflaged objects.

Abstract

Open-Vocabulary Camouflaged Object Segmentation (OVCOS) seeks to segment and classify camouflaged objects from arbitrary categories, presenting unique challenges due to visual ambiguity and unseen categories.Recent approaches typically adopt a two-stage paradigm: first segmenting objects, then classifying the segmented regions using Vision Language Models (VLMs).However, these methods (1) suffer from a domain gap caused by the mismatch between VLMs' full-image training and cropped-region inference, and (2) depend on generic segmentation models optimized for well-delineated objects, making them less effective for camouflaged objects.Without explicit guidance, generic segmentation models often overlook subtle boundaries, leading to imprecise segmentation.In this paper,we introduce a novel VLM-guided cascaded framework to address these issues in OVCOS.For segmentation, we leverage the Segment Anything Model (SAM), guided by the VLM.Our framework uses VLM-derived features as explicit prompts to SAM, effectively directing attention to camouflaged regions and significantly improving localization accuracy.For classification, we avoid the domain gap introduced by hard cropping.Instead, we treat the segmentation output as a soft spatial prior via the alpha channel, which retains the full image context while providing precise spatial guidance, leading to more accurate and context-aware classification of camouflaged objects.The same VLM is shared across both segmentation and classification to ensure efficiency and semantic consistency.Extensive experiments on both OVCOS and conventional camouflaged object segmentation benchmarks demonstrate the clear superiority of our method, highlighting the effectiveness of leveraging rich VLM semantics for both segmentation and classification of camouflaged objects.

Paper Structure

This paper contains 25 sections, 9 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Different Camouflaged Object Segmentation (COS) paradigms in two-stage OVCOS. (a) Generic segmentation models, such as MaskFormer maskformer, typically operate directly on the input image without target-specific guidance, and are primarily designed to segment salient foreground objects. (b) Our segmentation model leverages vision-language embeddings from CLIP as prompts to guide the SAM model, directing attention to the camouflaged area.
  • Figure 2: Comparison of mask-guided classification strategies. (a) Mask cropping strategy: applies the segmentation mask to crop the input image before feeding it into the CLIP image encoder. (b) Ours: fuses the segmentation mask with the original image for region-aware classification while retaining full-image context.
  • Figure 3: Overview of the cascaded segment and classify framework. In Stage-1, the adapted SAM model generates a class-agnostic camouflaged segmentation mask using textual and visual embeddings as prompts. In Stage-2, we use the generated segmentation mask to enable region-aware open-vocabulary classification.
  • Figure 4: The CLIP fine-tuning pipeline. The language branch encodes base class labels $C_{seen}$ with a camouflage-specific prompt template and learnable textual prompts $P_t$ to obtain textual embeddings $E_t^N$. The vision branch fuses features from the input image and alpha mask, combined with visual prompts $P_v$ injected via an MLP, and passes them to the frozen CLIP image encoder to obtain visual embedding $E_v$. Similarity scores $S$ are computed by aligning $E_t^N$ and $E_v$ in a shared space.
  • Figure 5: Overview of the adapted SAM framework.(a)Adapted SAM for COS: Our fine-tuned CLIP provides textual embeddings $E_t^N$, visual embedding $E_v$, and similarity scores $S$, which are projected into condition prompts $P_c$ via a Prompt Adapter. Image features $X$ extracted by SAM ViT encoder are refined by adapters. The Mask Decoder integrates $X$ and $P_c$ to predict the segmentation mask $M$ and edge map $E$, enabling precise localization. (b)Prompt Adapter: Selects the most relevant textual embedding based on $S$, and projects both $E_t$ and $E_v$ into a unified condition space via lightweight MLPs to guide the decoder. (c)Adapted Mask Decoder: Combines image features $X$, condition prompts $P_c$, and output tokens $T_{\text{tokens}}$ to produce accurate masks and edge maps, improving segmentation in camouflaged scenes.
  • ...and 1 more figures