Table of Contents
Fetching ...

Drift-Resilient Temporal Priors for Visual Tracking

Yuqing Huang, Liting Lin, Weijun Zhuang, Zhenyu He, Xin Li

Abstract

Temporal information is crucial for visual tracking, but existing multi-frame trackers are vulnerable to model drift caused by naively aggregating noisy historical predictions. In this paper, we introduce DTPTrack, a lightweight and generalizable module designed to be seamlessly integrated into existing trackers to suppress drift. Our framework consists of two core components: (1) a Temporal Reliability Calibrator (TRC) mechanism that learns to assign a per-frame reliability score to historical states, filtering out noise while anchoring on the ground-truth template; and (2) a Temporal Guidance Synthesizer (TGS) module that synthesizes this calibrated history into a compact set of dynamic temporal priors to provide predictive guidance. To demonstrate its versatility, we integrate DTPTrack into three diverse tracking architectures--OSTrack, ODTrack, and LoRAT-and show consistent, significant performance gains across all baselines. Our best-performing model, built upon an extended LoRATv2 backbone, sets a new state-of-the-art on several benchmarks, achieving a 77.5% Success rate on LaSOT and an 80.3% AO on GOT-10k.

Drift-Resilient Temporal Priors for Visual Tracking

Abstract

Temporal information is crucial for visual tracking, but existing multi-frame trackers are vulnerable to model drift caused by naively aggregating noisy historical predictions. In this paper, we introduce DTPTrack, a lightweight and generalizable module designed to be seamlessly integrated into existing trackers to suppress drift. Our framework consists of two core components: (1) a Temporal Reliability Calibrator (TRC) mechanism that learns to assign a per-frame reliability score to historical states, filtering out noise while anchoring on the ground-truth template; and (2) a Temporal Guidance Synthesizer (TGS) module that synthesizes this calibrated history into a compact set of dynamic temporal priors to provide predictive guidance. To demonstrate its versatility, we integrate DTPTrack into three diverse tracking architectures--OSTrack, ODTrack, and LoRAT-and show consistent, significant performance gains across all baselines. Our best-performing model, built upon an extended LoRATv2 backbone, sets a new state-of-the-art on several benchmarks, achieving a 77.5% Success rate on LaSOT and an 80.3% AO on GOT-10k.

Paper Structure

This paper contains 32 sections, 4 equations, 4 figures, 11 tables.

Figures (4)

  • Figure 1: Comparison of temporal modeling strategies in visual tracking. (a) Autoregressive trackers propagate historical predictions through a sequence model, making them vulnerable to cumulative errors. (b) Dynamic memory trackers update an internal memory over time, but noisy predictions can contaminate the memory state. (c) Online spatial--temporal trackers process short video clips jointly but treat all historical frames as equally reliable. (d) DTPTrack uses Temporal Reliability Calibrator(TRC) to reliability-weight historical frames and Temporal Guidance Synthesizer (TGS) to produce temporal prior tokens that guide the tracking backbone and suppress drift.
  • Figure 2: Architectural Overview of the DTPTrack Module within our Extended LoRATv2 Backbone. Our module operates in two stages before the main Transformer blocks. First, the Temporal Reliability Calibrator (TRC) block summarizes the feature embeddings of the template ($z_0$) and three historical reference frames ($z_1, z_2, z_3$) and computes a reliability weight for each. The confidence for the initial ground-truth frame ($z_0$) is fixed to 1.0. Second, the Temporal Guidance Synthesizer (TGS) block converts these calibrated summaries into a set of dynamic prior tokens. These tokens are then prepended to the original sequence of visual features to guide the host tracker's backbone through its Frame-Wise Causal Attention layers.
  • Figure 3: Visualization of the gating mechanism. The model assigns lower scores (indicated by darker colors or lower magnitude) to frames with artifacts or occlusions.
  • Figure 4: Qualitative comparison with state-of-the-art trackers on challenging scenarios.