Table of Contents
Fetching ...

SAM-PD: How Far Can SAM Take Us in Tracking and Segmenting Anything in Videos by Prompt Denoising

Tao Zhou, Wenhan Luo, Qi Ye, Zhiguo Shi, Jiming Chen

TL;DR

The paper tackles zero-shot video object tracking and segmentation for open-set objects by leveraging a promptable segmentation model (SAM) without external tracking modules. It reframes tracking as a prompt denoising task, propagating the previous frame's mask bounding box as the next-frame prompt and applying a multi-prompt strategy plus a point-based refinement to handle displacement, scale changes, and occlusions. The proposed SAM-PD method demonstrates competitive performance on DAVIS2017, YouTube-VOS2018, and UVO, illustrating that foundation-model-based video tracking is feasible with lightweight extensions. This work provides a practical baseline for SAM-based downstream tasks and highlights current limitations in semantic discrimination and memory for robust long-term tracking.

Abstract

Recently, promptable segmentation models, such as the Segment Anything Model (SAM), have demonstrated robust zero-shot generalization capabilities on static images. These promptable models exhibit denoising abilities for imprecise prompt inputs, such as imprecise bounding boxes. In this paper, we explore the potential of applying SAM to track and segment objects in videos where we recognize the tracking task as a prompt denoising task. Specifically, we iteratively propagate the bounding box of each object's mask in the preceding frame as the prompt for the next frame. Furthermore, to enhance SAM's denoising capability against position and size variations, we propose a multi-prompt strategy where we provide multiple jittered and scaled box prompts for each object and preserve the mask prediction with the highest semantic similarity to the template mask. We also introduce a point-based refinement stage to handle occlusions and reduce cumulative errors. Without involving tracking modules, our approach demonstrates comparable performance in video object/instance segmentation tasks on three datasets: DAVIS2017, YouTubeVOS2018, and UVO, serving as a concise baseline and endowing SAM-based downstream applications with tracking capabilities.

SAM-PD: How Far Can SAM Take Us in Tracking and Segmenting Anything in Videos by Prompt Denoising

TL;DR

The paper tackles zero-shot video object tracking and segmentation for open-set objects by leveraging a promptable segmentation model (SAM) without external tracking modules. It reframes tracking as a prompt denoising task, propagating the previous frame's mask bounding box as the next-frame prompt and applying a multi-prompt strategy plus a point-based refinement to handle displacement, scale changes, and occlusions. The proposed SAM-PD method demonstrates competitive performance on DAVIS2017, YouTube-VOS2018, and UVO, illustrating that foundation-model-based video tracking is feasible with lightweight extensions. This work provides a practical baseline for SAM-based downstream tasks and highlights current limitations in semantic discrimination and memory for robust long-term tracking.

Abstract

Recently, promptable segmentation models, such as the Segment Anything Model (SAM), have demonstrated robust zero-shot generalization capabilities on static images. These promptable models exhibit denoising abilities for imprecise prompt inputs, such as imprecise bounding boxes. In this paper, we explore the potential of applying SAM to track and segment objects in videos where we recognize the tracking task as a prompt denoising task. Specifically, we iteratively propagate the bounding box of each object's mask in the preceding frame as the prompt for the next frame. Furthermore, to enhance SAM's denoising capability against position and size variations, we propose a multi-prompt strategy where we provide multiple jittered and scaled box prompts for each object and preserve the mask prediction with the highest semantic similarity to the template mask. We also introduce a point-based refinement stage to handle occlusions and reduce cumulative errors. Without involving tracking modules, our approach demonstrates comparable performance in video object/instance segmentation tasks on three datasets: DAVIS2017, YouTubeVOS2018, and UVO, serving as a concise baseline and endowing SAM-based downstream applications with tracking capabilities.
Paper Structure (16 sections, 2 equations, 8 figures, 7 tables)

This paper contains 16 sections, 2 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Top row: two examples where SAM works well with noisy bounding box prompts. Bottom row: our vanilla idea of utilizing the denoising capability of SAM to track and segment objects throughout a video.
  • Figure 2: Overview of our method. Given the mask prediction from the previous frame $I_{t-1}$, we propagate its bounding box to the next frame $I_{t}$ as the box prompt for the corresponding object. We augment this vanilla idea with two extensions: a multi-prompt strategy and a point-based mask refinement. The former constructs a group of jittered and scaled box prompts for each object, leading to multiple mask predictions. Among these predictions, we retain the one with the maximum semantic similarity to the template mask (elaborated in Fig. \ref{['fig: multi_box_prompt']}). The latter stage takes the coarse mask, samples one positive prompt (orange points) inside it, and combines it with negative point prompts (blue points) sampled from other foreground objects. We refine the coarse mask with these point prompts, leveraging SAM's multi-prediction capability. The two extensions help reduce cumulative errors. For further details, please refer to the document.
  • Figure 3: Detailed design of the multi-prompt strategy, where we use white boxes and orange boxes to indicate box prompt inputs and the bounding boxes of mask predictions, respectively.
  • Figure 4: Qualitative results on DAVIS 2017 validation set. For each predicted mask, we also plot the box prompt propagated from the previous frame.
  • Figure 5: Failure cases. Our method is bounded by the stability of SAM to box prompts (the third frame of the first case) and faces challenges in handling large object motion (the fourth frame of the first case) and full occlusion.
  • ...and 3 more figures