When SAM2 Meets Video Shadow and Mirror Detection
Leiping Jie
TL;DR
This work evaluates SAM2 on two video segmentation tasks—Video Shadow Detection and Video Mirror Detection—by initializing the first frame with either point or mask prompts and propagating masks through subsequent frames. Results show that mask prompts yield strong, state-of-the-art performance across ViSha and VMD datasets, while point prompts fail to produce reliable first-frame masks, leading to poor temporal segmentation. The findings highlight SAM2's potential for temporally-consistent video segmentation when prompts reliably encode the initial object, and reveal a critical limitation for point-based initialization. The paper suggests future work on auto-generated prompts to improve robustness and practical applicability in real-world video analysis tasks.
Abstract
As the successor to the Segment Anything Model (SAM), the Segment Anything Model 2 (SAM2) not only improves performance in image segmentation but also extends its capabilities to video segmentation. However, its effectiveness in segmenting rare objects that seldom appear in videos remains underexplored. In this study, we evaluate SAM2 on three distinct video segmentation tasks: Video Shadow Detection (VSD) and Video Mirror Detection (VMD). Specifically, we use ground truth point or mask prompts to initialize the first frame and then predict corresponding masks for subsequent frames. Experimental results show that SAM2's performance on these tasks is suboptimal, especially when point prompts are used, both quantitatively and qualitatively. Code is available at \url{https://github.com/LeipingJie/SAM2Video}
