Adapting SAM 2 for Visual Object Tracking: 1st Place Solution for MMVPR Challenge Multi-Modal Tracking
Cheng-Yen Yang, Hsiang-Wei Huang, Pyong-Kun Kim, Chien-Kai Kuo, Jui-Wei Chang, Kwang-Ju Kim, Chung-I Huang, Jenq-Neng Hwang
TL;DR
The paper addresses robust Visual Object Tracking under occlusion, motion blur, and appearance changes by adapting the Segment Anything Model 2 (SAM2) for VOT. It leverages SAM2's memory-driven video segmentation to derive frame-wise bounding boxes and introduces backward tracking and tracklet interpolation to improve stability. The approach achieves a top result (AUC 89.4%) on the 2024 ICPR Multi-modal Tracking challenge, with RGB data providing the strongest signal among modalities. This work highlights how task-specific enhancements to high-quality segmentation models can substantially boost multi-modal VOT performance in practical scenarios.
Abstract
We present an effective approach for adapting the Segment Anything Model 2 (SAM2) to the Visual Object Tracking (VOT) task. Our method leverages the powerful pre-trained capabilities of SAM2 and incorporates several key techniques to enhance its performance in VOT applications. By combining SAM2 with our proposed optimizations, we achieved a first place AUC score of 89.4 on the 2024 ICPR Multi-modal Object Tracking challenge, demonstrating the effectiveness of our approach. This paper details our methodology, the specific enhancements made to SAM2, and a comprehensive analysis of our results in the context of VOT solutions along with the multi-modality aspect of the dataset.
