Table of Contents
Fetching ...

IP-SAM: Prompt-Space Conditioning for Prompt-Absent Camouflaged Object Detection

Huiyao Zhang, Jin Bai, Rui Guo, JianWen Tan, HongFei Wang, Ye Li

Abstract

Prompt-conditioned foundation segmenters have emerged as a dominant paradigm for image segmentation, where explicit spatial prompts (e.g., points, boxes, masks) guide mask decoding. However, many real-world deployments require fully automatic segmentation, creating a structural mismatch: the decoder expects prompts that are unavailable at inference. Existing adaptations typically modify intermediate features, inadvertently bypassing the model's native prompt interface and weakening prompt-conditioned decoding. We propose IP-SAM, which revisits adaptation from a prompt-space perspective through prompt-space conditioning. Specifically, a Self-Prompt Generator (SPG) distills image context into complementary intrinsic prompts that serve as coarse regional anchors. These cues are projected through SAM2's frozen prompt encoder, restoring prompt-guided decoding without external intervention. To suppress background-induced false positives, Prompt-Space Gating (PSG) leverages the intrinsic background prompt as an asymmetric suppressive constraint prior to decoding. Under a deterministic no-external-prompt protocol, IP-SAM achieves state-of-the-art performance across four camouflaged object detection benchmarks (e.g., MAE 0.017 on COD10K) with only 21.26M trainable parameters (optimizing SPG, PSG, and a task-specific mask decoder trained from scratch, alongside image-encoder LoRA while keeping the prompt encoder frozen). Furthermore, the proposed conditioning strategy generalizes beyond COD to medical polyp segmentation, where a model trained solely on Kvasir-SEG exhibits strong zero-shot transfer to both CVC-ClinicDB and ETIS.

IP-SAM: Prompt-Space Conditioning for Prompt-Absent Camouflaged Object Detection

Abstract

Prompt-conditioned foundation segmenters have emerged as a dominant paradigm for image segmentation, where explicit spatial prompts (e.g., points, boxes, masks) guide mask decoding. However, many real-world deployments require fully automatic segmentation, creating a structural mismatch: the decoder expects prompts that are unavailable at inference. Existing adaptations typically modify intermediate features, inadvertently bypassing the model's native prompt interface and weakening prompt-conditioned decoding. We propose IP-SAM, which revisits adaptation from a prompt-space perspective through prompt-space conditioning. Specifically, a Self-Prompt Generator (SPG) distills image context into complementary intrinsic prompts that serve as coarse regional anchors. These cues are projected through SAM2's frozen prompt encoder, restoring prompt-guided decoding without external intervention. To suppress background-induced false positives, Prompt-Space Gating (PSG) leverages the intrinsic background prompt as an asymmetric suppressive constraint prior to decoding. Under a deterministic no-external-prompt protocol, IP-SAM achieves state-of-the-art performance across four camouflaged object detection benchmarks (e.g., MAE 0.017 on COD10K) with only 21.26M trainable parameters (optimizing SPG, PSG, and a task-specific mask decoder trained from scratch, alongside image-encoder LoRA while keeping the prompt encoder frozen). Furthermore, the proposed conditioning strategy generalizes beyond COD to medical polyp segmentation, where a model trained solely on Kvasir-SEG exhibits strong zero-shot transfer to both CVC-ClinicDB and ETIS.

Paper Structure

This paper contains 19 sections, 3 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Overall architecture of IP-SAM. Shifting the paradigm to prompt-space conditioning, IP-SAM utilizes a Self-Prompt Generator (SPG) to synthesize complementary intrinsic prompts ($P^\pm$). These are projected into SAM2's native manifold via the frozen prompt encoder. Prior to decoding, Prompt-Space Gating (PSG) leverages the negative embedding ($Z^-$) to asymmetrically suppress deceptive cues in the positive embedding ($Z^+$). The purified condition ($Z$) explicitly steers the mask decoder. For visual clarity, the optional auxiliary structural regularizer (ablated in $\S~$\ref{['sec:ablation']}) is omitted.
  • Figure 2: Detailed schematic of Prompt-Space Gating (PSG). PSG isolates and eliminates background-induced false positives before they enter the decoder. It formulates a localized gate decision map from the negative prompt embedding ($Z^-$) to physically filter deceptive cues from the positive counterpart ($Z^+$). The suppressed feature ($\widetilde{Z}^+$) is then fused with unactivated background cues and compensated via an anchored residual connection to yield the robust Conditioned Prompt Embedding ($Z$).
  • Figure 3: Qualitative comparison on challenging camouflaged scenarios. Visual predictions of our IP-SAM against representative specialist COD models and SAM-based adaptations under the prompt-absent setting.
  • Figure 4: Internal feature visualization of the Prompt-Space Gating (PSG) mechanism. Feature evolution from the baseline's leakage (Col 2) to IP-SAM's clean prediction (Col 3). Intermediate columns depict suppressed false positives (Col 4), complementary intrinsic prompts (Cols 5-6), gate decision map (Col 7), and feature energy adjustments (Col 8).