Table of Contents
Fetching ...

Utilizing Grounded SAM for self-supervised frugal camouflaged human detection

Matthias Pijarowski, Alexander Wolpert, Martin Heckmann, Michael Teutsch

TL;DR

This work tackles camouflaged human detection under data scarcity by combining frugal learning and self-supervised learning. It fine-tunes pre-trained COD models (HitNet and SINet-V2) on a small labeled subset of CPD1K and leverages Grounded SAM (GSAM) to generate pseudo-labels for SSL, achieving near fully supervised performance with only about 6% of the data. An optimized GSAM-based labeling pipeline can even surpass GT-based frugal training in some metrics, though GSAM struggles on empty-background images, highlighting a limitation in zero-shot COD. The results suggest a promising path toward scalable COD with limited labeling and motivate exploring additional foundation-model-based pseudo-labels and robustness to non-object scenes.

Abstract

Visually detecting camouflaged objects is a hard problem for both humans and computer vision algorithms. Strong similarities between object and background appearance make the task significantly more challenging than traditional object detection or segmentation tasks. Current state-of-the-art models use either convolutional neural networks or vision transformers as feature extractors. They are trained in a fully supervised manner and thus need a large amount of labeled training data. In this paper, both self-supervised and frugal learning methods are introduced to the task of Camouflaged Object Detection (COD). The overall goal is to fine-tune two COD reference methods, namely SINet-V2 and HitNet, pre-trained for camouflaged animal detection to the task of camouflaged human detection. Therefore, we use the public dataset CPD1K that contains camouflaged humans in a forest environment. We create a strong baseline using supervised frugal transfer learning for the fine-tuning task. Then, we analyze three pseudo-labeling approaches to perform the fine-tuning task in a self-supervised manner. Our experiments show that we achieve similar performance by pure self-supervision compared to fully supervised frugal learning.

Utilizing Grounded SAM for self-supervised frugal camouflaged human detection

TL;DR

This work tackles camouflaged human detection under data scarcity by combining frugal learning and self-supervised learning. It fine-tunes pre-trained COD models (HitNet and SINet-V2) on a small labeled subset of CPD1K and leverages Grounded SAM (GSAM) to generate pseudo-labels for SSL, achieving near fully supervised performance with only about 6% of the data. An optimized GSAM-based labeling pipeline can even surpass GT-based frugal training in some metrics, though GSAM struggles on empty-background images, highlighting a limitation in zero-shot COD. The results suggest a promising path toward scalable COD with limited labeling and motivate exploring additional foundation-model-based pseudo-labels and robustness to non-object scenes.

Abstract

Visually detecting camouflaged objects is a hard problem for both humans and computer vision algorithms. Strong similarities between object and background appearance make the task significantly more challenging than traditional object detection or segmentation tasks. Current state-of-the-art models use either convolutional neural networks or vision transformers as feature extractors. They are trained in a fully supervised manner and thus need a large amount of labeled training data. In this paper, both self-supervised and frugal learning methods are introduced to the task of Camouflaged Object Detection (COD). The overall goal is to fine-tune two COD reference methods, namely SINet-V2 and HitNet, pre-trained for camouflaged animal detection to the task of camouflaged human detection. Therefore, we use the public dataset CPD1K that contains camouflaged humans in a forest environment. We create a strong baseline using supervised frugal transfer learning for the fine-tuning task. Then, we analyze three pseudo-labeling approaches to perform the fine-tuning task in a self-supervised manner. Our experiments show that we achieve similar performance by pure self-supervision compared to fully supervised frugal learning.
Paper Structure (12 sections, 6 equations, 8 figures, 5 tables)

This paper contains 12 sections, 6 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Different examples for the results of different phrases when prompting GSAM. Since camouflaged humans are difficult to spot, the only phrase that provides sufficient segmentation results is 'soldier'.
  • Figure 2: Cumulative mean $F_{\beta}^{w}\uparrow$ measure with 95 % confidence intervals across 30 repeated runs for SINet-V2 (left) and HitNet (right) and for $k=\{1,2,3,5,10,30,50\}$. After about 10 runs the mean value stabilizes. The performance ($F_{\beta}^{w}\uparrow$ measure) only slightly increases beyond $k=30$. HitNet outperforms SINet-V2. The relative gap between the fully fine-tuned and the frugally learned HitNet with $k=30$ is about 10 %.
  • Figure 3: Relative gap between the fully fine-tuned models and the $k$-shot models for HitNet and SINet-V2 based on the $F_{\beta}^{w}$ measure. HitNet is able to narrow this gap clearly better compared to SINet-V2.
  • Figure 4: Example output of GSAM prompted with the phrase 'soldier'. In the first row, the soldier is located well by Grounding DINO, which leads to SAM generating a mostly correct segmentation map that can serve well as a pseudo-label. In the second row, the soldier is not detected and the incorrect bounding box proposal leads to an arbitrary segmentation map. The human is indicated in red color in the input image.
  • Figure 5: Results of MAT-based image inpainting for different tile sizes: first row 128px, second row 64px, third row 32px. Three methods are tested to measure region similarities: (c) pixel-error based pixel similarity, (d) MAE-based region similarity, and (e) SSIM. The assumed low similarity would be indicated by a dark blue color. None of the tested approaches meets this assumption.
  • ...and 3 more figures