Table of Contents
Fetching ...

Real-World Point Tracking with Verifier-Guided Pseudo-Labeling

Görkay Aydemir, Fatma Güney, Weidi Xie

Abstract

Models for long-term point tracking are typically trained on large synthetic datasets. The performance of these models degrades in real-world videos due to different characteristics and the absence of dense ground-truth annotations. Self-training on unlabeled videos has been explored as a practical solution, but the quality of pseudo-labels strongly depends on the reliability of teacher models, which vary across frames and scenes. In this paper, we address the problem of real-world fine-tuning and introduce verifier, a meta-model that learns to assess the reliability of tracker predictions and guide pseudo-label generation. Given candidate trajectories from multiple pretrained trackers, the verifier evaluates them per frame and selects the most trustworthy predictions, resulting in high-quality pseudo-label trajectories. When applied for fine-tuning, verifier-guided pseudo-labeling substantially improves the quality of supervision and enables data-efficient adaptation to unlabeled videos. Extensive experiments on four real-world benchmarks demonstrate that our approach achieves state-of-the-art results while requiring less data than prior self-training methods. Project page: https://kuis-ai.github.io/track_on_r

Real-World Point Tracking with Verifier-Guided Pseudo-Labeling

Abstract

Models for long-term point tracking are typically trained on large synthetic datasets. The performance of these models degrades in real-world videos due to different characteristics and the absence of dense ground-truth annotations. Self-training on unlabeled videos has been explored as a practical solution, but the quality of pseudo-labels strongly depends on the reliability of teacher models, which vary across frames and scenes. In this paper, we address the problem of real-world fine-tuning and introduce verifier, a meta-model that learns to assess the reliability of tracker predictions and guide pseudo-label generation. Given candidate trajectories from multiple pretrained trackers, the verifier evaluates them per frame and selects the most trustworthy predictions, resulting in high-quality pseudo-label trajectories. When applied for fine-tuning, verifier-guided pseudo-labeling substantially improves the quality of supervision and enables data-efficient adaptation to unlabeled videos. Extensive experiments on four real-world benchmarks demonstrate that our approach achieves state-of-the-art results while requiring less data than prior self-training methods. Project page: https://kuis-ai.github.io/track_on_r
Paper Structure (24 sections, 8 equations, 6 figures, 6 tables)

This paper contains 24 sections, 8 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Verifier-guided real-world adaptation.(Left) Given a query point in a real-world video, multiple off-the-shelf trackers produce alternative trajectory hypotheses. Verifier evaluates these per-frame predictions and selects the most reliable ones, forming a refined pseudo-label trajectory. (Right) Unlike naïve self-training, which randomly selects a teacher model for pseudo-label generation, the verifier adaptively combines predictions from multiple teachers, providing cleaner supervision for the student tracker during real-world fine-tuning.
  • Figure 2: Teacher inconsistency and oracle performance.(a) Across 4 real-world datasets, six off-the-shelf teacher models (shown on the legend) are compared against an oracle that, at each frame, selects the most accurate teacher prediction. Individual teachers (colored circles) cluster below the oracle (diamonds), while the black horizontal line marks the performance of random teacher selection. The large gap between the oracle and both individual models and random selection highlights the substantial headroom available for adaptive, per-frame selection. (b) Example from TAP-Vid Kinetics Doersch2022NeurIPS: Teacher predictions whose pixel errors fluctuate across time. The upper plot shows per-frame pixel error curves, with occluded frames shaded in gray. Colored lines correspond to the same trackers as in (a), illustrating that accuracy varies across time. The lower row shows uniformly sampled frames with teacher predictions and the ground-truth point (white star).
  • Figure 3: Verifier overview. Given query points at frame $t_0$ and their candidate predictions (teacher outputs during inference or randomly augmented trajectories during training), we extract local features for both the query and each candidate, producing query features $\mathbf{f}^q$ (replicated across time) and candidate features $\mathbf{f}_t$ (a vector for each candidate, per frame). The query features are then decoded by the Candidate Transformer (right), which consists of restricted cross-attention, where each frame-level query attends only to its corresponding candidates, followed by self-attention and feed-forward layers. The transformer outputs per-frame reliability distributions over candidates, capturing spatial and temporal consistency based on feature similarity.
  • Figure 4: Verifier as inference time ensemble. Comparison of the verifier ensemble against individual teacher models and the random-teacher baseline on real-world datasets. All teacher results are reproduced using their official checkpoints. The verifier consistently achieves the best performance across datasets, demonstrating its ability to exploit the complementary strengths of different models.
  • Figure 5: Localized Feature Extraction. Given frame-wise features of the query frame $t_0$ and target frame $t$, denoted by $\mathbf{F}_{t_0}$ and $\mathbf{F}_t$, we first bilinearly sample the reference feature $\mathbf{q}_{\text{sample}}$ at the query location. A deformable attention module $\phi_{\text{def}}$ then aggregates localized context around each candidate location ($\mathbf{C}_t$), producing descriptors $\mathbf{h}_t$. We concatenate displacement embeddings $\eta(\cdot)$ with an identity embedding (query vs. candidate) and project via $\phi_{\text{proj}}$ to obtain the final query and candidate features consumed by the candidate transformer.
  • ...and 1 more figures