Segment Anything Meets Point Tracking

Frano Rajič; Lei Ke; Yu-Wing Tai; Chi-Keung Tang; Martin Danelljan; Fisher Yu

Segment Anything Meets Point Tracking

Frano Rajič, Lei Ke, Yu-Wing Tai, Chi-Keung Tang, Martin Danelljan, Fisher Yu

TL;DR

This work introduces SAM-PT, a point-centric interactive video segmentation framework that couples sparse point tracking with the Segment Anything Model (SAM) to achieve zero-shot video segmentation. By propagating user-defined query points through long-term trackers and prompting SAM with non-occluded points, SAM-PT produces per-frame masks without requiring video segmentation training data. The approach yields strong zero-shot performance across DAVIS, YouTube-VOS, MOSE, and BDD100K, and competitive results in open-world VIS on UVO, while enabling interactive annotation workflows that markedly reduce labeling effort. The results underscore the value of integrating long-term point tracking with foundation-model prompting for robust video segmentation across diverse scenarios, with noted limitations in occluded or tiny objects and opportunities for future refinement.

Abstract

The Segment Anything Model (SAM) has established itself as a powerful zero-shot image segmentation model, enabled by efficient point-centric annotation and prompt-based models. While click and brush interactions are both well explored in interactive image segmentation, the existing methods on videos focus on mask annotation and propagation. This paper presents SAM-PT, a novel method for point-centric interactive video segmentation, empowered by SAM and long-term point tracking. SAM-PT leverages robust and sparse point selection and propagation techniques for mask generation. Compared to traditional object-centric mask propagation strategies, we uniquely use point propagation to exploit local structure information agnostic to object semantics. We highlight the merits of point-based tracking through direct evaluation on the zero-shot open-world Unidentified Video Objects (UVO) benchmark. Our experiments on popular video object segmentation and multi-object segmentation tracking benchmarks, including DAVIS, YouTube-VOS, and BDD100K, suggest that a point-based segmentation tracker yields better zero-shot performance and efficient interactions. We release our code that integrates different point trackers and video segmentation benchmarks at https://github.com/SysCV/sam-pt.

Segment Anything Meets Point Tracking

TL;DR

Abstract

Paper Structure (26 sections, 9 figures, 14 tables, 4 algorithms)

This paper contains 26 sections, 9 figures, 14 tables, 4 algorithms.

Introduction
Related Work
Method
Preliminaries: SAM
Ours: SAM-PT
SAM-PT vs. Object-centric Mask Propagation
Experiments
Datasets
Implementation Details
Ablation Study
Video Object Segmentation
Video Instance Segmentation
Interactive Point-Based Video Segmentation
Conclusion
MOSE 2023
...and 11 more sections

Figures (9)

Figure 1: Segment Anything Meets Point Tracking (SAM-PT). SAM-PT is a point-centric method that utilizes sparse point propagation for interactive video segmentation, enabling easier interaction and faster annotation. We extend SAM kirillov2023segment with long-term point trackers to effectively operate on videos in a zero-shot manner. SAM-PT takes user clicks as "query points" which either denote the target object (positive points) or designate non-target segments (negative points). The points are tracked throughout the video using point trackers that propagate the query points to all video frames, producing trajectory predictions and occlusion scores. SAM is subsequently prompted with the non-occluded points in the trajectories as to output a segmentation mask for each video frame independently. The propagated points can be further edited for accurate segmentation and tracking.
Figure 2: Segment Anything Meets Point Tracking (SAM-PT) overview. The essence of SAM-PT is to extend image segmentation foundation models to effectively operate on videos. SAM-PT has four steps: $\mathbf{1}$) Query Points Selection. It starts with first-frame query points which denote the target object (positive points) or designate non-target segments (negative points). These points are provided by the user or derived from a ground truth mask. $\mathbf{2}$) Point Tracking. Initiated with the query points, our approach leverages point trackers to propagate the points across video frames, predicting point trajectories and occlusion scores. $\mathbf{3}$) Segmentation. The trajectories are then used to prompt the Segment Anything Model (SAM) and output per-frame mask predictions. $\mathbf{4}$) Point Tracking Reinitialization. Optionally, the predicted masks are used to reinitialize the query points and restart the process when reaching a prediction horizon $h$. Reinitialization helps by getting rid of unreliable points and adding points to object segments that become visible in later frames.
Figure 3: Positive Point Sampling. For an image paired with either a ground truth or predicted segmentation mask, positive points are sampled from within the mask area using one of the following point sampling methods: Random, K-Medoids park2009kmedoids, Shi-Tomasi shi1994good, or Mixed. Notably, Random Sampling and K-Medoids Sampling only require the segmentation mask for input, not the corresponding input image. For negative points, we always use Mixed Sampling on the target object's background mask.
Figure 4: Interacting with SAM in SAM-PT. In the first pass, SAM is prompted exclusively with positive points to define the object's initial localization. In the second pass, both positive and negative points along with the previous mask prediction are fed to the same mask decoder for further mask refinement. The negative points remove segments from the background and neighboring objects and notably help in cases when the point tracker mistakenly predicts positive points off the target object. The second pass is repeated iteratively to get a refined segmentation mask.
Figure 5: Visualization of SAM-PT on DAVIS 2017 ponttuset2018davis. The method shows its capability to segment and track objects using the initial masks from the first frame, with circles denoting positive points and crosses negative points. Red symbols indicate occlusion prediction.
...and 4 more figures

Segment Anything Meets Point Tracking

TL;DR

Abstract

Segment Anything Meets Point Tracking

Authors

TL;DR

Abstract

Table of Contents

Figures (9)