Efficient Motion Prompt Learning for Robust Visual Tracking
Jie Zhao, Xin Chen, Yongsheng Yuan, Michael Felsberg, Dong Wang, Huchuan Lu
TL;DR
The paper tackles robustness gaps in visual object tracking by incorporating temporal motion cues alongside appearance cues. It introduces Motion Prompt Tracking (MPT), a plug-and-play module that encodes long-term motion trajectories with three positional encodings, fuses them with visual embeddings via a Transformer-based fusion decoder, and adaptively weights their contributions. Training is efficient: only the MPT components are fine-tuned while baselines are frozen, enabling strong gains across seven benchmarks with minimal training cost. The results show significant improvements in robustness on challenging scenarios such as distractors and occlusions, with broad compatibility across multiple trackers, highlighting practical impact for real-time visual tracking.
Abstract
Due to the challenges of processing temporal information, most trackers depend solely on visual discriminability and overlook the unique temporal coherence of video data. In this paper, we propose a lightweight and plug-and-play motion prompt tracking method. It can be easily integrated into existing vision-based trackers to build a joint tracking framework leveraging both motion and vision cues, thereby achieving robust tracking through efficient prompt learning. A motion encoder with three different positional encodings is proposed to encode the long-term motion trajectory into the visual embedding space, while a fusion decoder and an adaptive weight mechanism are designed to dynamically fuse visual and motion features. We integrate our motion module into three different trackers with five models in total. Experiments on seven challenging tracking benchmarks demonstrate that the proposed motion module significantly improves the robustness of vision-based trackers, with minimal training costs and negligible speed sacrifice. Code is available at https://github.com/zj5559/Motion-Prompt-Tracking.
