Utilizing Grounded SAM for self-supervised frugal camouflaged human detection
Matthias Pijarowski, Alexander Wolpert, Martin Heckmann, Michael Teutsch
TL;DR
This work tackles camouflaged human detection under data scarcity by combining frugal learning and self-supervised learning. It fine-tunes pre-trained COD models (HitNet and SINet-V2) on a small labeled subset of CPD1K and leverages Grounded SAM (GSAM) to generate pseudo-labels for SSL, achieving near fully supervised performance with only about 6% of the data. An optimized GSAM-based labeling pipeline can even surpass GT-based frugal training in some metrics, though GSAM struggles on empty-background images, highlighting a limitation in zero-shot COD. The results suggest a promising path toward scalable COD with limited labeling and motivate exploring additional foundation-model-based pseudo-labels and robustness to non-object scenes.
Abstract
Visually detecting camouflaged objects is a hard problem for both humans and computer vision algorithms. Strong similarities between object and background appearance make the task significantly more challenging than traditional object detection or segmentation tasks. Current state-of-the-art models use either convolutional neural networks or vision transformers as feature extractors. They are trained in a fully supervised manner and thus need a large amount of labeled training data. In this paper, both self-supervised and frugal learning methods are introduced to the task of Camouflaged Object Detection (COD). The overall goal is to fine-tune two COD reference methods, namely SINet-V2 and HitNet, pre-trained for camouflaged animal detection to the task of camouflaged human detection. Therefore, we use the public dataset CPD1K that contains camouflaged humans in a forest environment. We create a strong baseline using supervised frugal transfer learning for the fine-tuning task. Then, we analyze three pseudo-labeling approaches to perform the fine-tuning task in a self-supervised manner. Our experiments show that we achieve similar performance by pure self-supervision compared to fully supervised frugal learning.
