Segment Anything Meets Point Tracking
Frano Rajič, Lei Ke, Yu-Wing Tai, Chi-Keung Tang, Martin Danelljan, Fisher Yu
TL;DR
This work introduces SAM-PT, a point-centric interactive video segmentation framework that couples sparse point tracking with the Segment Anything Model (SAM) to achieve zero-shot video segmentation. By propagating user-defined query points through long-term trackers and prompting SAM with non-occluded points, SAM-PT produces per-frame masks without requiring video segmentation training data. The approach yields strong zero-shot performance across DAVIS, YouTube-VOS, MOSE, and BDD100K, and competitive results in open-world VIS on UVO, while enabling interactive annotation workflows that markedly reduce labeling effort. The results underscore the value of integrating long-term point tracking with foundation-model prompting for robust video segmentation across diverse scenarios, with noted limitations in occluded or tiny objects and opportunities for future refinement.
Abstract
The Segment Anything Model (SAM) has established itself as a powerful zero-shot image segmentation model, enabled by efficient point-centric annotation and prompt-based models. While click and brush interactions are both well explored in interactive image segmentation, the existing methods on videos focus on mask annotation and propagation. This paper presents SAM-PT, a novel method for point-centric interactive video segmentation, empowered by SAM and long-term point tracking. SAM-PT leverages robust and sparse point selection and propagation techniques for mask generation. Compared to traditional object-centric mask propagation strategies, we uniquely use point propagation to exploit local structure information agnostic to object semantics. We highlight the merits of point-based tracking through direct evaluation on the zero-shot open-world Unidentified Video Objects (UVO) benchmark. Our experiments on popular video object segmentation and multi-object segmentation tracking benchmarks, including DAVIS, YouTube-VOS, and BDD100K, suggest that a point-based segmentation tracker yields better zero-shot performance and efficient interactions. We release our code that integrates different point trackers and video segmentation benchmarks at https://github.com/SysCV/sam-pt.
