An Instance-Aware Prompting Framework for Training-free Camouflaged Object Segmentation
Chao Yin, Jide Li, Hang Yao, Xiaoqiang Li
TL;DR
The paper tackles training-free Camouflaged Object Segmentation (COS) by addressing the limitation of semantic prompts in existing pipelines. It introduces IAPF, which converts a single task-generic prompt into instance-level visual prompts via a detector-agnostic box enumerator, Single-Foreground Multi-Background Prompting (SFMBP) with Spatial CLIP heatmaps, and a self-consistency voting mechanism, all using frozen components. The approach yields accurate, instance-discriminative masks for multi-object camouflage and achieves state-of-the-art results among training-free methods, with strong performance on CIS benchmarks and downstream RGB-only tasks. This instance-aware prompting framework offers a practical, zero-shot solution with robust cross-domain generalization for promptable segmentation in complex scenes.
Abstract
Training-free Camouflaged Object Segmentation (COS) seeks to segment camouflaged objects without task-specific training, by automatically generating visual prompts to guide the Segment Anything Model (SAM). However, existing pipelines mostly yield semantic-level prompts, which drive SAM to coarse semantic masks and struggle to handle multiple discrete camouflaged instances effectively. To address this critical limitation, we propose an \textbf{I}nstance-\textbf{A}ware \textbf{P}rompting \textbf{F}ramework (IAPF) tailored for the first training-free COS that upgrades prompt granularity from semantic to instance-level while keeping all components frozen. The centerpiece is an Instance Mask Generator that (i) leverages a detector-agnostic enumerator to produce precise instance-level box prompts for the foreground tag, and (ii) introduces the Single-Foreground Multi-Background Prompting (SFMBP) strategy to sample region-constrained point prompts within each box prompt, enabling SAM to output instance masks. The pipeline is supported by a simple text prompt generator that produces image-specific tags and a self-consistency vote across synonymous task-generic prompts to stabilize inference. Extensive evaluations on three COS benchmarks, two CIS benchmarks, and two downstream datasets demonstrate state-of-the-art performance among training-free methods. Code will be released upon acceptance.
