Table of Contents
Fetching ...

When SAM2 Meets Video Shadow and Mirror Detection

Leiping Jie

TL;DR

This work evaluates SAM2 on two video segmentation tasks—Video Shadow Detection and Video Mirror Detection—by initializing the first frame with either point or mask prompts and propagating masks through subsequent frames. Results show that mask prompts yield strong, state-of-the-art performance across ViSha and VMD datasets, while point prompts fail to produce reliable first-frame masks, leading to poor temporal segmentation. The findings highlight SAM2's potential for temporally-consistent video segmentation when prompts reliably encode the initial object, and reveal a critical limitation for point-based initialization. The paper suggests future work on auto-generated prompts to improve robustness and practical applicability in real-world video analysis tasks.

Abstract

As the successor to the Segment Anything Model (SAM), the Segment Anything Model 2 (SAM2) not only improves performance in image segmentation but also extends its capabilities to video segmentation. However, its effectiveness in segmenting rare objects that seldom appear in videos remains underexplored. In this study, we evaluate SAM2 on three distinct video segmentation tasks: Video Shadow Detection (VSD) and Video Mirror Detection (VMD). Specifically, we use ground truth point or mask prompts to initialize the first frame and then predict corresponding masks for subsequent frames. Experimental results show that SAM2's performance on these tasks is suboptimal, especially when point prompts are used, both quantitatively and qualitatively. Code is available at \url{https://github.com/LeipingJie/SAM2Video}

When SAM2 Meets Video Shadow and Mirror Detection

TL;DR

This work evaluates SAM2 on two video segmentation tasks—Video Shadow Detection and Video Mirror Detection—by initializing the first frame with either point or mask prompts and propagating masks through subsequent frames. Results show that mask prompts yield strong, state-of-the-art performance across ViSha and VMD datasets, while point prompts fail to produce reliable first-frame masks, leading to poor temporal segmentation. The findings highlight SAM2's potential for temporally-consistent video segmentation when prompts reliably encode the initial object, and reveal a critical limitation for point-based initialization. The paper suggests future work on auto-generated prompts to improve robustness and practical applicability in real-world video analysis tasks.

Abstract

As the successor to the Segment Anything Model (SAM), the Segment Anything Model 2 (SAM2) not only improves performance in image segmentation but also extends its capabilities to video segmentation. However, its effectiveness in segmenting rare objects that seldom appear in videos remains underexplored. In this study, we evaluate SAM2 on three distinct video segmentation tasks: Video Shadow Detection (VSD) and Video Mirror Detection (VMD). Specifically, we use ground truth point or mask prompts to initialize the first frame and then predict corresponding masks for subsequent frames. Experimental results show that SAM2's performance on these tasks is suboptimal, especially when point prompts are used, both quantitatively and qualitatively. Code is available at \url{https://github.com/LeipingJie/SAM2Video}
Paper Structure (10 sections, 5 figures, 2 tables)

This paper contains 10 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Qualitative comparison of the predicted segmentations using mask prompts on the Visha dataset. The first images of every two rows represents the rgb and groud truth image of the 1st frame, while the other images shown in the even row and in the odd row are the ground truth and predicted shadow masks for the 11th, 21th, 31th, 41th, 51th, 61th, 71th, 81th and 91th, respectively. Best viewed on screen.
  • Figure 2: Qualitative comparison of the predicted segmentations using mask prompts on the VMD dataset. The first images of every two rows represents the rgb and groud truth image of the 1st frame, while the other images shown in the even row and in the odd row are the ground truth and predicted shadow masks for the 11th, 21th, 31th, 41th, 51th, , respectively. Best viewed on screen.
  • Figure 3: Visual effects of masks (organe areas) generated by varying number of point prompts using SAM2. The first two columns show the input images and the ground truths. The remaining columns show the initial segmentation results (the 1st frame in a video) with different numbers of point prompts. The number of point prompts is indicated at the bottom of each column. We use green and red five-pointed stars to represent the positive and negative predictions, respectively. Best viewed on screen.
  • Figure 4: Qualitative comparison of the predicted segmentations using point prompts on the Visha dataset. The first images of every two rows represents the rgb and groud truth image of the 1st frame, while the other images shown in the even row and in the odd row are the ground truth and predicted shadow points for the 11th, 21th, 31th, 41th, 51th, 61th, 71th, 81th and 91th, respectively. Best viewed on screen.
  • Figure 5: Qualitative comparison of the predicted segmentations using point prompts on the VMD dataset. The first images of every two rows represents the rgb and groud truth image of the 1st frame, while the other images shown in the even row and in the odd row are the ground truth and predicted shadow points for the 11th, 21th, 31th, 41th, 51th, respectively. Best viewed on screen.