Table of Contents
Fetching ...

Weakly-Supervised Concealed Object Segmentation with SAM-based Pseudo Labeling and Multi-scale Feature Grouping

Chunming He, Kai Li, Yachao Zhang, Guoxia Xu, Longxiang Tang, Yulun Zhang, Zhenhua Guo, Xiu Li

TL;DR

This work tackles weakly-supervised Concealed Object Segmentation (COS) by leveraging the Segment Anything Model (SAM) to generate pseudo labels from sparse prompts and introducing a Multi-scale Feature Grouping (MFG) module to promote feature coherence across concealed objects. The WS-SAM framework employs multi-augmentation fusion, entropy-based pixel weighting, and image-level selection to produce reliable supervision, while MFG decomposes features into prototypes at multiple granularities and aggregates them with an RK2-inspired scheme. Together, these components address both weak supervision and intrinsic foreground-background similarity, enabling robust single- and multi-object segmentation. Extensive experiments across camouflaged object detection, polyp segmentation, and transparent object detection demonstrate state-of-the-art performance and strong robustness of the approach.

Abstract

Weakly-Supervised Concealed Object Segmentation (WSCOS) aims to segment objects well blended with surrounding environments using sparsely-annotated data for model training. It remains a challenging task since (1) it is hard to distinguish concealed objects from the background due to the intrinsic similarity and (2) the sparsely-annotated training data only provide weak supervision for model learning. In this paper, we propose a new WSCOS method to address these two challenges. To tackle the intrinsic similarity challenge, we design a multi-scale feature grouping module that first groups features at different granularities and then aggregates these grouping results. By grouping similar features together, it encourages segmentation coherence, helping obtain complete segmentation results for both single and multiple-object images. For the weak supervision challenge, we utilize the recently-proposed vision foundation model, Segment Anything Model (SAM), and use the provided sparse annotations as prompts to generate segmentation masks, which are used to train the model. To alleviate the impact of low-quality segmentation masks, we further propose a series of strategies, including multi-augmentation result ensemble, entropy-based pixel-level weighting, and entropy-based image-level selection. These strategies help provide more reliable supervision to train the segmentation model. We verify the effectiveness of our method on various WSCOS tasks, and experiments demonstrate that our method achieves state-of-the-art performance on these tasks.

Weakly-Supervised Concealed Object Segmentation with SAM-based Pseudo Labeling and Multi-scale Feature Grouping

TL;DR

This work tackles weakly-supervised Concealed Object Segmentation (COS) by leveraging the Segment Anything Model (SAM) to generate pseudo labels from sparse prompts and introducing a Multi-scale Feature Grouping (MFG) module to promote feature coherence across concealed objects. The WS-SAM framework employs multi-augmentation fusion, entropy-based pixel weighting, and image-level selection to produce reliable supervision, while MFG decomposes features into prototypes at multiple granularities and aggregates them with an RK2-inspired scheme. Together, these components address both weak supervision and intrinsic foreground-background similarity, enabling robust single- and multi-object segmentation. Extensive experiments across camouflaged object detection, polyp segmentation, and transparent object detection demonstrate state-of-the-art performance and strong robustness of the approach.

Abstract

Weakly-Supervised Concealed Object Segmentation (WSCOS) aims to segment objects well blended with surrounding environments using sparsely-annotated data for model training. It remains a challenging task since (1) it is hard to distinguish concealed objects from the background due to the intrinsic similarity and (2) the sparsely-annotated training data only provide weak supervision for model learning. In this paper, we propose a new WSCOS method to address these two challenges. To tackle the intrinsic similarity challenge, we design a multi-scale feature grouping module that first groups features at different granularities and then aggregates these grouping results. By grouping similar features together, it encourages segmentation coherence, helping obtain complete segmentation results for both single and multiple-object images. For the weak supervision challenge, we utilize the recently-proposed vision foundation model, Segment Anything Model (SAM), and use the provided sparse annotations as prompts to generate segmentation masks, which are used to train the model. To alleviate the impact of low-quality segmentation masks, we further propose a series of strategies, including multi-augmentation result ensemble, entropy-based pixel-level weighting, and entropy-based image-level selection. These strategies help provide more reliable supervision to train the segmentation model. We verify the effectiveness of our method on various WSCOS tasks, and experiments demonstrate that our method achieves state-of-the-art performance on these tasks.
Paper Structure (12 sections, 11 equations, 5 figures, 7 tables)

This paper contains 12 sections, 11 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Framework of WS-SAM with scribble supervision. Note that the corresponding masks of the augmented images are inversely transformed so as to be consistent with the original image.
  • Figure 2: Masks of SAM with different augmented images. We inversely transform the masks to keep consistent with the original image. It is observed that fused masks contain more accurate and complete segmentation information.
  • Figure 3: Architecture of the proposed model. $\Phi_P$ denotes feature grouping with $P$ prototypes. We simplify the broadcast process in $\Phi_P$ for space limitation.
  • Figure 4: Visualized results for the three WSCOS tasks.
  • Figure 5: Five runs results with varied point annotations.