Table of Contents
Fetching ...

Adapting SAM 2 for Visual Object Tracking: 1st Place Solution for MMVPR Challenge Multi-Modal Tracking

Cheng-Yen Yang, Hsiang-Wei Huang, Pyong-Kun Kim, Chien-Kai Kuo, Jui-Wei Chang, Kwang-Ju Kim, Chung-I Huang, Jenq-Neng Hwang

TL;DR

The paper addresses robust Visual Object Tracking under occlusion, motion blur, and appearance changes by adapting the Segment Anything Model 2 (SAM2) for VOT. It leverages SAM2's memory-driven video segmentation to derive frame-wise bounding boxes and introduces backward tracking and tracklet interpolation to improve stability. The approach achieves a top result (AUC 89.4%) on the 2024 ICPR Multi-modal Tracking challenge, with RGB data providing the strongest signal among modalities. This work highlights how task-specific enhancements to high-quality segmentation models can substantially boost multi-modal VOT performance in practical scenarios.

Abstract

We present an effective approach for adapting the Segment Anything Model 2 (SAM2) to the Visual Object Tracking (VOT) task. Our method leverages the powerful pre-trained capabilities of SAM2 and incorporates several key techniques to enhance its performance in VOT applications. By combining SAM2 with our proposed optimizations, we achieved a first place AUC score of 89.4 on the 2024 ICPR Multi-modal Object Tracking challenge, demonstrating the effectiveness of our approach. This paper details our methodology, the specific enhancements made to SAM2, and a comprehensive analysis of our results in the context of VOT solutions along with the multi-modality aspect of the dataset.

Adapting SAM 2 for Visual Object Tracking: 1st Place Solution for MMVPR Challenge Multi-Modal Tracking

TL;DR

The paper addresses robust Visual Object Tracking under occlusion, motion blur, and appearance changes by adapting the Segment Anything Model 2 (SAM2) for VOT. It leverages SAM2's memory-driven video segmentation to derive frame-wise bounding boxes and introduces backward tracking and tracklet interpolation to improve stability. The approach achieves a top result (AUC 89.4%) on the 2024 ICPR Multi-modal Tracking challenge, with RGB data providing the strongest signal among modalities. This work highlights how task-specific enhancements to high-quality segmentation models can substantially boost multi-modal VOT performance in practical scenarios.

Abstract

We present an effective approach for adapting the Segment Anything Model 2 (SAM2) to the Visual Object Tracking (VOT) task. Our method leverages the powerful pre-trained capabilities of SAM2 and incorporates several key techniques to enhance its performance in VOT applications. By combining SAM2 with our proposed optimizations, we achieved a first place AUC score of 89.4 on the 2024 ICPR Multi-modal Object Tracking challenge, demonstrating the effectiveness of our approach. This paper details our methodology, the specific enhancements made to SAM2, and a comprehensive analysis of our results in the context of VOT solutions along with the multi-modality aspect of the dataset.

Paper Structure

This paper contains 15 sections, 7 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: An example illustration of adapting SAM2 sam2 on VOT task. This pipeline leverages SAM2's powerful segmentation capabilities by using an initial bounding box prompt on the first frame, then utilizes its memory bank feature to propagate and refine object masks through subsequent video frames, enabling efficient and accurate object tracking.
  • Figure 2: Sample of the multi-modal videos from the testing sequence.
  • Figure 3: Visualization of our tracking results on the ICPR multi-modal tracking dataset. We selected three different tracking cases with different level of difficulties, caused by the moving speed of object, occlusion, and distractor in the environment.