When SAM2 Meets Video Camouflaged Object Segmentation: A Comprehensive Evaluation and Adaptation
Yuli Zhou, Guolei Sun, Yawei Li, Guo-Sen Xie, Luca Benini, Ender Konukoglu
TL;DR
This work systematically evaluates SAM2 for video camouflaged object segmentation (VCOS) on MoCA-Mask and CAD, analyzing zero-shot performance, prompting strategies, and integration with multimodal language models. It demonstrates that semi-supervised prompting, especially mask-based prompts on middle frames, yields strong segmentation with favorable speed and model size, outperforming several state-of-the-art VCOS methods. Refinement with SAM2 improves VCOS results when applied to existing VCOS outputs, though MLLM-based automation can falter due to initial prompt quality; targeted fine-tuning on MoCA-Mask further boosts accuracy. Overall, the study establishes SAM2 as a promising tool for VCOS with practical implications for real-time camouflaged object tracking, and provides concrete guidance for prompting, refinement, and dataset-adaptive training.
Abstract
This study investigates the application and performance of the Segment Anything Model 2 (SAM2) in the challenging task of video camouflaged object segmentation (VCOS). VCOS involves detecting objects that blend seamlessly in the surroundings for videos, due to similar colors and textures, poor light conditions, etc. Compared to the objects in normal scenes, camouflaged objects are much more difficult to detect. SAM2, a video foundation model, has shown potential in various tasks. But its effectiveness in dynamic camouflaged scenarios remains under-explored. This study presents a comprehensive study on SAM2's ability in VCOS. First, we assess SAM2's performance on camouflaged video datasets using different models and prompts (click, box, and mask). Second, we explore the integration of SAM2 with existing multimodal large language models (MLLMs) and VCOS methods. Third, we specifically adapt SAM2 by fine-tuning it on the video camouflaged dataset. Our comprehensive experiments demonstrate that SAM2 has excellent zero-shot ability of detecting camouflaged objects in videos. We also show that this ability could be further improved by specifically adjusting SAM2's parameters for VCOS. The code is available at https://github.com/zhoustan/SAM2-VCOS
