Table of Contents
Fetching ...

Efficient Motion Prompt Learning for Robust Visual Tracking

Jie Zhao, Xin Chen, Yongsheng Yuan, Michael Felsberg, Dong Wang, Huchuan Lu

TL;DR

The paper tackles robustness gaps in visual object tracking by incorporating temporal motion cues alongside appearance cues. It introduces Motion Prompt Tracking (MPT), a plug-and-play module that encodes long-term motion trajectories with three positional encodings, fuses them with visual embeddings via a Transformer-based fusion decoder, and adaptively weights their contributions. Training is efficient: only the MPT components are fine-tuned while baselines are frozen, enabling strong gains across seven benchmarks with minimal training cost. The results show significant improvements in robustness on challenging scenarios such as distractors and occlusions, with broad compatibility across multiple trackers, highlighting practical impact for real-time visual tracking.

Abstract

Due to the challenges of processing temporal information, most trackers depend solely on visual discriminability and overlook the unique temporal coherence of video data. In this paper, we propose a lightweight and plug-and-play motion prompt tracking method. It can be easily integrated into existing vision-based trackers to build a joint tracking framework leveraging both motion and vision cues, thereby achieving robust tracking through efficient prompt learning. A motion encoder with three different positional encodings is proposed to encode the long-term motion trajectory into the visual embedding space, while a fusion decoder and an adaptive weight mechanism are designed to dynamically fuse visual and motion features. We integrate our motion module into three different trackers with five models in total. Experiments on seven challenging tracking benchmarks demonstrate that the proposed motion module significantly improves the robustness of vision-based trackers, with minimal training costs and negligible speed sacrifice. Code is available at https://github.com/zj5559/Motion-Prompt-Tracking.

Efficient Motion Prompt Learning for Robust Visual Tracking

TL;DR

The paper tackles robustness gaps in visual object tracking by incorporating temporal motion cues alongside appearance cues. It introduces Motion Prompt Tracking (MPT), a plug-and-play module that encodes long-term motion trajectories with three positional encodings, fuses them with visual embeddings via a Transformer-based fusion decoder, and adaptively weights their contributions. Training is efficient: only the MPT components are fine-tuned while baselines are frozen, enabling strong gains across seven benchmarks with minimal training cost. The results show significant improvements in robustness on challenging scenarios such as distractors and occlusions, with broad compatibility across multiple trackers, highlighting practical impact for real-time visual tracking.

Abstract

Due to the challenges of processing temporal information, most trackers depend solely on visual discriminability and overlook the unique temporal coherence of video data. In this paper, we propose a lightweight and plug-and-play motion prompt tracking method. It can be easily integrated into existing vision-based trackers to build a joint tracking framework leveraging both motion and vision cues, thereby achieving robust tracking through efficient prompt learning. A motion encoder with three different positional encodings is proposed to encode the long-term motion trajectory into the visual embedding space, while a fusion decoder and an adaptive weight mechanism are designed to dynamically fuse visual and motion features. We integrate our motion module into three different trackers with five models in total. Experiments on seven challenging tracking benchmarks demonstrate that the proposed motion module significantly improves the robustness of vision-based trackers, with minimal training costs and negligible speed sacrifice. Code is available at https://github.com/zj5559/Motion-Prompt-Tracking.

Paper Structure

This paper contains 23 sections, 6 equations, 13 figures, 7 tables.

Figures (13)

  • Figure 1: Illustration of different tracking paradigms. Our plug-and-play method enables visual trackers to benefit from motion prompts, making them more akin to the human tracking paradigm.
  • Figure 2: Pipeline of our motion prompt-based tracking method. The historical trajectory is first embedded into the visual embedding space by our motion encoder, and then fused with the visual embedding using the proposed fusion decoder. An adaptive weight mechanism is employed to further dynamically adjust vision and motion cues. Ultimately, the obtained fused embedding is used to make a robust tracking prediction by the tracking head $\rm Head_{Tr}$.
  • Figure 3: Illustration of adaptive weights, and tracking performance based on different cues.
  • Figure 4: Success rate of among varying trajectory qualities.
  • Figure 5: IoU distributions of average trajectory and the last frame in success and failure cases.
  • ...and 8 more figures