Evaluating SAM2 for Video Semantic Segmentation
Syed Hesham Syed Ariff, Yun Liu, Guolei Sun, Jing Yang, Henghui Ding, Xue Geng, Xudong Jiang
TL;DR
This work investigates applying SAM2 to dense Video Semantic Segmentation by evaluating two pipelines: using SAM2 as a post-processing refiner to improve existing VSS masks, and using SAM2 as an independent semantic segmentor via feature-based classification of masklets. The refiner consistently enhances boundary accuracy (mBIoU) and temporal consistency across multiple backbones and datasets, though it operates offline with modest frame-rate costs. In contrast, the standalone SAM2-based segmentor underperforms compared to dedicated VSS models, highlighting a semantic discrimination gap in SAM2’s class-agnostic features. The study concludes that SAM2 holds practical value as a high-quality refinement module for VSS, while future work should incorporate semantic awareness, adapters, and open-vocabulary prompting to bridge the gap for dense semantic tasks.
Abstract
The Segmentation Anything Model 2 (SAM2) has proven to be a powerful foundation model for promptable visual object segmentation in both images and videos, capable of storing object-aware memories and transferring them temporally through memory blocks. While SAM2 excels in video object segmentation by providing dense segmentation masks based on prompts, extending it to dense Video Semantic Segmentation (VSS) poses challenges due to the need for spatial accuracy, temporal consistency, and the ability to track multiple objects with complex boundaries and varying scales. This paper explores the extension of SAM2 for VSS, focusing on two primary approaches and highlighting firsthand observations and common challenges faced during this process. The first approach involves using SAM2 to extract unique objects as masks from a given image, with a segmentation network employed in parallel to generate and refine initial predictions. The second approach utilizes the predicted masks to extract unique feature vectors, which are then fed into a simple network for classification. The resulting classifications and masks are subsequently combined to produce the final segmentation. Our experiments suggest that leveraging SAM2 enhances overall performance in VSS, primarily due to its precise predictions of object boundaries.
