Table of Contents
Fetching ...

Discover, Segment, and Select: A Progressive Mechanism for Zero-shot Camouflaged Object Segmentation

Yilong Yang, Jianxin Tian, Shengchuan Zhang, Liujuan Cao

TL;DR

The paper introduces Discover, Segment, and Select (DSS), a training-free COS framework that combines feature-based object discovery, SAM-based segmentation, and MLLM-driven mask selection to overcome localization inaccuracies and multi-instance failures common in prior zero-shot approaches. DSS enhances discovery with a Part Composition module and Similarity-based Box Generation to produce high-quality prompts, segments them via SAM, and then employs a Semantic-driven Mask Selection module that uses progressive, pairwise MLLM comparisons to identify the best mask. Extensive experiments on CHAMELEON, CAMO, COD10K, and NC4K show DSS achieves state-of-the-art zero-shot COS performance, especially in multi-instance scenes, while maintaining reasonable inference efficiency and lower memory usage. The work provides strong practical value for training-free camouflage segmentation and offers a modular framework that could benefit related zero-shot visual segmentation tasks.

Abstract

Current zero-shot Camouflaged Object Segmentation methods typically employ a two-stage pipeline (discover-then-segment): using MLLMs to obtain visual prompts, followed by SAM segmentation. However, relying solely on MLLMs for camouflaged object discovery often leads to inaccurate localization, false positives, and missed detections. To address these issues, we propose the \textbf{D}iscover-\textbf{S}egment-\textbf{S}elect (\textbf{DSS}) mechanism, a progressive framework designed to refine segmentation step by step. The proposed method contains a Feature-coherent Object Discovery (FOD) module that leverages visual features to generate diverse object proposals, a segmentation module that refines these proposals through SAM segmentation, and a Semantic-driven Mask Selection (SMS) module that employs MLLMs to evaluate and select the optimal segmentation mask from multiple candidates. Without requiring any training or supervision, DSS achieves state-of-the-art performance on multiple COS benchmarks, especially in multiple-instance scenes.

Discover, Segment, and Select: A Progressive Mechanism for Zero-shot Camouflaged Object Segmentation

TL;DR

The paper introduces Discover, Segment, and Select (DSS), a training-free COS framework that combines feature-based object discovery, SAM-based segmentation, and MLLM-driven mask selection to overcome localization inaccuracies and multi-instance failures common in prior zero-shot approaches. DSS enhances discovery with a Part Composition module and Similarity-based Box Generation to produce high-quality prompts, segments them via SAM, and then employs a Semantic-driven Mask Selection module that uses progressive, pairwise MLLM comparisons to identify the best mask. Extensive experiments on CHAMELEON, CAMO, COD10K, and NC4K show DSS achieves state-of-the-art zero-shot COS performance, especially in multi-instance scenes, while maintaining reasonable inference efficiency and lower memory usage. The work provides strong practical value for training-free camouflage segmentation and offers a modular framework that could benefit related zero-shot visual segmentation tasks.

Abstract

Current zero-shot Camouflaged Object Segmentation methods typically employ a two-stage pipeline (discover-then-segment): using MLLMs to obtain visual prompts, followed by SAM segmentation. However, relying solely on MLLMs for camouflaged object discovery often leads to inaccurate localization, false positives, and missed detections. To address these issues, we propose the \textbf{D}iscover-\textbf{S}egment-\textbf{S}elect (\textbf{DSS}) mechanism, a progressive framework designed to refine segmentation step by step. The proposed method contains a Feature-coherent Object Discovery (FOD) module that leverages visual features to generate diverse object proposals, a segmentation module that refines these proposals through SAM segmentation, and a Semantic-driven Mask Selection (SMS) module that employs MLLMs to evaluate and select the optimal segmentation mask from multiple candidates. Without requiring any training or supervision, DSS achieves state-of-the-art performance on multiple COS benchmarks, especially in multiple-instance scenes.
Paper Structure (16 sections, 11 equations, 8 figures, 7 tables)

This paper contains 16 sections, 11 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Comparison between the proposed DSS framework and prior zero-shot COS methods.
  • Figure 2: The overall pipeline of the Discovery-Segment-Selection Framework. The framework operates in three stages: (a) Feature-guided Object Discovery (FOD): the Part Composition (PC) module refines low-resolution clustering masks, followed by a Similarity based Box Generation (SBG) module to produce prompt boxes; (b) The Segment Anything Model generates segmentation mask based on the input prompts. (c) Semantic-driven Mask Selection (SMS): an MLLM evaluates all candidate masks and selects the final segmentation.
  • Figure 3: Visual tracking of the iterative refinement process. Left to right: the original image, Leiden clustering map, and binary maps from initial clustering $\mathbf{Y}^{(0)}$ to the final refined map $\mathbf{Y}^{(5)}$. Here we select one cluster from the clustering results for demonstration. The number of iterations may vary for different clusters until convergence.
  • Figure 4: Visual comparison between bboxes generated from different strategies. In each 2$\times$2 grid, Top left: bboxes from the ground truth mask. Top right: bboxes from leiden clustered mask. Bottom left: bboxes from PC refined mask. Bottom right: bboxes from self-similarity map.
  • Figure 5: We present a visual comparison of our method against existing alternatives under challenging conditions. The red and black bounding boxes correspond to prompts generated by our approach and QWen, respectively. To aid interpretation, the third column visualizes the similarity maps from which our bounding boxes are derived.
  • ...and 3 more figures