Table of Contents
Fetching ...

SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory

Cheng-Yen Yang, Hsiang-Wei Huang, Wenhao Chai, Zhongyu Jiang, Jenq-Neng Hwang

TL;DR

SAMURAI extends SAM 2 for zero-shot visual tracking by integrating a Kalman-filter-based motion model and a motion-aware memory selection mechanism. This dual approach refines mask choice and curates a memory bank that emphasizes temporally coherent, high-quality cues, reducing error propagation in long sequences without any retraining. Empirical results show state-of-the-art performance on LaSOT, LaSOT_ext, and GOT-10k, with competitive results on TrackingNet, NFS, and OTB100, while maintaining real-time inference. The method demonstrates strong generalization across diverse scenes, notably handling crowded environments and occlusions better than prior SAM-based trackers, making it suitable for real-world dynamic settings.

Abstract

The Segment Anything Model 2 (SAM 2) has demonstrated strong performance in object segmentation tasks but faces challenges in visual object tracking, particularly when managing crowded scenes with fast-moving or self-occluding objects. Furthermore, the fixed-window memory approach in the original model does not consider the quality of memories selected to condition the image features for the next frame, leading to error propagation in videos. This paper introduces SAMURAI, an enhanced adaptation of SAM 2 specifically designed for visual object tracking. By incorporating temporal motion cues with the proposed motion-aware memory selection mechanism, SAMURAI effectively predicts object motion and refines mask selection, achieving robust, accurate tracking without the need for retraining or fine-tuning. SAMURAI operates in real-time and demonstrates strong zero-shot performance across diverse benchmark datasets, showcasing its ability to generalize without fine-tuning. In evaluations, SAMURAI achieves significant improvements in success rate and precision over existing trackers, with a 7.1% AUC gain on LaSOT$_{\text{ext}}$ and a 3.5% AO gain on GOT-10k. Moreover, it achieves competitive results compared to fully supervised methods on LaSOT, underscoring its robustness in complex tracking scenarios and its potential for real-world applications in dynamic environments.

SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory

TL;DR

SAMURAI extends SAM 2 for zero-shot visual tracking by integrating a Kalman-filter-based motion model and a motion-aware memory selection mechanism. This dual approach refines mask choice and curates a memory bank that emphasizes temporally coherent, high-quality cues, reducing error propagation in long sequences without any retraining. Empirical results show state-of-the-art performance on LaSOT, LaSOT_ext, and GOT-10k, with competitive results on TrackingNet, NFS, and OTB100, while maintaining real-time inference. The method demonstrates strong generalization across diverse scenes, notably handling crowded environments and occlusions better than prior SAM-based trackers, making it suitable for real-world dynamic settings.

Abstract

The Segment Anything Model 2 (SAM 2) has demonstrated strong performance in object segmentation tasks but faces challenges in visual object tracking, particularly when managing crowded scenes with fast-moving or self-occluding objects. Furthermore, the fixed-window memory approach in the original model does not consider the quality of memories selected to condition the image features for the next frame, leading to error propagation in videos. This paper introduces SAMURAI, an enhanced adaptation of SAM 2 specifically designed for visual object tracking. By incorporating temporal motion cues with the proposed motion-aware memory selection mechanism, SAMURAI effectively predicts object motion and refines mask selection, achieving robust, accurate tracking without the need for retraining or fine-tuning. SAMURAI operates in real-time and demonstrates strong zero-shot performance across diverse benchmark datasets, showcasing its ability to generalize without fine-tuning. In evaluations, SAMURAI achieves significant improvements in success rate and precision over existing trackers, with a 7.1% AUC gain on LaSOT and a 3.5% AO gain on GOT-10k. Moreover, it achieves competitive results compared to fully supervised methods on LaSOT, underscoring its robustness in complex tracking scenarios and its potential for real-world applications in dynamic environments.

Paper Structure

This paper contains 33 sections, 9 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Illustration of two common failure cases in visual object tracking using SAM 2: (1) In a crowded scene with similar appearances between target and background objects, SAM 2 tends to ignore the motion cue and predict where the mask has the higher IoU score. (2) The original memory bank simply chooses and stores the previous $n$ frames into the memory bank, resulting in introducing some bad features during occlusion.
  • Figure 2: The overview of our SAMURAI visual object tracker.
  • Figure 3: SUC and P$_{\text{norm}}$ plots of LaSOT and LaSOT$_\text{ext}$.
  • Figure 4: Visualization of tracking results comparing SAMURAI with existing methods. (Top) Conventional VOT methods often struggle in crowded scenarios where the target object is surrounded by objects with similar appearances. (Bottom) The baseline SAM-based method suffers from fixed-window memory composition, leading to error propagation and reduced overall tracking accuracy due to ID switches.