Utilizing Grounded SAM for self-supervised frugal camouflaged human detection

Matthias Pijarowski; Alexander Wolpert; Martin Heckmann; Michael Teutsch

Utilizing Grounded SAM for self-supervised frugal camouflaged human detection

Matthias Pijarowski, Alexander Wolpert, Martin Heckmann, Michael Teutsch

TL;DR

This work tackles camouflaged human detection under data scarcity by combining frugal learning and self-supervised learning. It fine-tunes pre-trained COD models (HitNet and SINet-V2) on a small labeled subset of CPD1K and leverages Grounded SAM (GSAM) to generate pseudo-labels for SSL, achieving near fully supervised performance with only about 6% of the data. An optimized GSAM-based labeling pipeline can even surpass GT-based frugal training in some metrics, though GSAM struggles on empty-background images, highlighting a limitation in zero-shot COD. The results suggest a promising path toward scalable COD with limited labeling and motivate exploring additional foundation-model-based pseudo-labels and robustness to non-object scenes.

Abstract

Visually detecting camouflaged objects is a hard problem for both humans and computer vision algorithms. Strong similarities between object and background appearance make the task significantly more challenging than traditional object detection or segmentation tasks. Current state-of-the-art models use either convolutional neural networks or vision transformers as feature extractors. They are trained in a fully supervised manner and thus need a large amount of labeled training data. In this paper, both self-supervised and frugal learning methods are introduced to the task of Camouflaged Object Detection (COD). The overall goal is to fine-tune two COD reference methods, namely SINet-V2 and HitNet, pre-trained for camouflaged animal detection to the task of camouflaged human detection. Therefore, we use the public dataset CPD1K that contains camouflaged humans in a forest environment. We create a strong baseline using supervised frugal transfer learning for the fine-tuning task. Then, we analyze three pseudo-labeling approaches to perform the fine-tuning task in a self-supervised manner. Our experiments show that we achieve similar performance by pure self-supervision compared to fully supervised frugal learning.

Utilizing Grounded SAM for self-supervised frugal camouflaged human detection

TL;DR

Abstract

Paper Structure (12 sections, 6 equations, 8 figures, 5 tables)

This paper contains 12 sections, 6 equations, 8 figures, 5 tables.

INTRODUCTION
RELATED WORK
METHODOLOGY
Frugal learning for COD
Generating pseudo-labels for self-supervised learning
Learning from noisy labels
EXPERIMENTS AND RESULTS
Fully supervised frugal learning
Self-supervised learning from pseudo-labels
Comparison between fully supervised learning and self-supervised frugal learning
Limitations of Grounded SAM for COD
CONCLUSION

Figures (8)

Figure 1: Different examples for the results of different phrases when prompting GSAM. Since camouflaged humans are difficult to spot, the only phrase that provides sufficient segmentation results is 'soldier'.
Figure 2: Cumulative mean $F_{\beta}^{w}\uparrow$ measure with 95 % confidence intervals across 30 repeated runs for SINet-V2 (left) and HitNet (right) and for $k=\{1,2,3,5,10,30,50\}$. After about 10 runs the mean value stabilizes. The performance ($F_{\beta}^{w}\uparrow$ measure) only slightly increases beyond $k=30$. HitNet outperforms SINet-V2. The relative gap between the fully fine-tuned and the frugally learned HitNet with $k=30$ is about 10 %.
Figure 3: Relative gap between the fully fine-tuned models and the $k$-shot models for HitNet and SINet-V2 based on the $F_{\beta}^{w}$ measure. HitNet is able to narrow this gap clearly better compared to SINet-V2.
Figure 4: Example output of GSAM prompted with the phrase 'soldier'. In the first row, the soldier is located well by Grounding DINO, which leads to SAM generating a mostly correct segmentation map that can serve well as a pseudo-label. In the second row, the soldier is not detected and the incorrect bounding box proposal leads to an arbitrary segmentation map. The human is indicated in red color in the input image.
Figure 5: Results of MAT-based image inpainting for different tile sizes: first row 128px, second row 64px, third row 32px. Three methods are tested to measure region similarities: (c) pixel-error based pixel similarity, (d) MAE-based region similarity, and (e) SSIM. The assumed low similarity would be indicated by a dark blue color. None of the tested approaches meets this assumption.
...and 3 more figures

Utilizing Grounded SAM for self-supervised frugal camouflaged human detection

TL;DR

Abstract

Utilizing Grounded SAM for self-supervised frugal camouflaged human detection

Authors

TL;DR

Abstract

Table of Contents

Figures (8)