Table of Contents
Fetching ...

When SAM2 Meets Video Camouflaged Object Segmentation: A Comprehensive Evaluation and Adaptation

Yuli Zhou, Guolei Sun, Yawei Li, Guo-Sen Xie, Luca Benini, Ender Konukoglu

TL;DR

This work systematically evaluates SAM2 for video camouflaged object segmentation (VCOS) on MoCA-Mask and CAD, analyzing zero-shot performance, prompting strategies, and integration with multimodal language models. It demonstrates that semi-supervised prompting, especially mask-based prompts on middle frames, yields strong segmentation with favorable speed and model size, outperforming several state-of-the-art VCOS methods. Refinement with SAM2 improves VCOS results when applied to existing VCOS outputs, though MLLM-based automation can falter due to initial prompt quality; targeted fine-tuning on MoCA-Mask further boosts accuracy. Overall, the study establishes SAM2 as a promising tool for VCOS with practical implications for real-time camouflaged object tracking, and provides concrete guidance for prompting, refinement, and dataset-adaptive training.

Abstract

This study investigates the application and performance of the Segment Anything Model 2 (SAM2) in the challenging task of video camouflaged object segmentation (VCOS). VCOS involves detecting objects that blend seamlessly in the surroundings for videos, due to similar colors and textures, poor light conditions, etc. Compared to the objects in normal scenes, camouflaged objects are much more difficult to detect. SAM2, a video foundation model, has shown potential in various tasks. But its effectiveness in dynamic camouflaged scenarios remains under-explored. This study presents a comprehensive study on SAM2's ability in VCOS. First, we assess SAM2's performance on camouflaged video datasets using different models and prompts (click, box, and mask). Second, we explore the integration of SAM2 with existing multimodal large language models (MLLMs) and VCOS methods. Third, we specifically adapt SAM2 by fine-tuning it on the video camouflaged dataset. Our comprehensive experiments demonstrate that SAM2 has excellent zero-shot ability of detecting camouflaged objects in videos. We also show that this ability could be further improved by specifically adjusting SAM2's parameters for VCOS. The code is available at https://github.com/zhoustan/SAM2-VCOS

When SAM2 Meets Video Camouflaged Object Segmentation: A Comprehensive Evaluation and Adaptation

TL;DR

This work systematically evaluates SAM2 for video camouflaged object segmentation (VCOS) on MoCA-Mask and CAD, analyzing zero-shot performance, prompting strategies, and integration with multimodal language models. It demonstrates that semi-supervised prompting, especially mask-based prompts on middle frames, yields strong segmentation with favorable speed and model size, outperforming several state-of-the-art VCOS methods. Refinement with SAM2 improves VCOS results when applied to existing VCOS outputs, though MLLM-based automation can falter due to initial prompt quality; targeted fine-tuning on MoCA-Mask further boosts accuracy. Overall, the study establishes SAM2 as a promising tool for VCOS with practical implications for real-time camouflaged object tracking, and provides concrete guidance for prompting, refinement, and dataset-adaptive training.

Abstract

This study investigates the application and performance of the Segment Anything Model 2 (SAM2) in the challenging task of video camouflaged object segmentation (VCOS). VCOS involves detecting objects that blend seamlessly in the surroundings for videos, due to similar colors and textures, poor light conditions, etc. Compared to the objects in normal scenes, camouflaged objects are much more difficult to detect. SAM2, a video foundation model, has shown potential in various tasks. But its effectiveness in dynamic camouflaged scenarios remains under-explored. This study presents a comprehensive study on SAM2's ability in VCOS. First, we assess SAM2's performance on camouflaged video datasets using different models and prompts (click, box, and mask). Second, we explore the integration of SAM2 with existing multimodal large language models (MLLMs) and VCOS methods. Third, we specifically adapt SAM2 by fine-tuning it on the video camouflaged dataset. Our comprehensive experiments demonstrate that SAM2 has excellent zero-shot ability of detecting camouflaged objects in videos. We also show that this ability could be further improved by specifically adjusting SAM2's parameters for VCOS. The code is available at https://github.com/zhoustan/SAM2-VCOS
Paper Structure (29 sections, 2 equations, 4 figures, 9 tables)

This paper contains 29 sections, 2 equations, 4 figures, 9 tables.

Figures (4)

  • Figure 1: Overview of our evaluation framework. The framework explores different prompts to SAM2 for VCOS. (a) Automatic mode selects the mask (generated by the built-in automatic mask generator) that best aligns with the ground truth mask to serve as the mask prompt. (b) The semi-supervised mode explores different prompt types and different prompt timing based on clicks, boxes, and masks. (c) MLLM + SAM2 utilizes an MLLM to generate bounding box coordinates as the box prompt. (d) VCOS + SAM2 employs a VCOS model to generate a coarse mask as the mask prompt. MLLM: multi-model large language model; SAM2: segment anything model 2; VCOS: video camouflaged object segmentation.
  • Figure 2: Visualization of masks generated by Automatic mode of SAM2 on MoCA-Mask. From top to bottom: the input frames, masks generated in automatic mode, and the ground truths. SAM2 can generate multiple masks (shown in different colors) for this mode. Best viewed in color.
  • Figure 3: Failure cases of SAM2 on MoCA-Mask.
  • Figure 4: Qualitative examples of SAM2 on MoCA-Mask.