Table of Contents
Fetching ...

Denoise to Track: Harnessing Video Diffusion Priors for Robust Correspondence

Tianyu Yuan, Yuanbo Yang, Lin-Zhuo Chen, Yao Yao, Zhuzhong Qian

TL;DR

This paper presents HeFT, a zero-shot point tracking framework that repurposes pretrained Video Diffusion Transformer priors to establish temporally consistent correspondences without labeled data. By revealing head-level specialization and the primacy of low-frequency features for matching, it introduces a frequency-aware feature selection strategy and a single-step denoising pipeline, paired with soft-argmax localization and forward-backward consistency for robust tracking. Empirical results on TAP-Vid show state-of-the-art zero-shot performance approaching supervised methods, underscoring diffusion models as powerful visual foundation models for downstream perception tasks. The work offers a step toward unified visual foundation models by extracting meaningful, granular priors from diffusion-based video representations.

Abstract

In this work, we introduce HeFT (Head-Frequency Tracker), a zero-shot point tracking framework that leverages the visual priors of pretrained video diffusion models. To better understand how they encode spatiotemporal information, we analyze the internal representations of Video Diffusion Transformer (VDiT). Our analysis reveals that attention heads act as minimal functional units with distinct specializations for matching, semantic understanding, and positional encoding. Additionally, we find that the low-frequency components in VDiT features are crucial for establishing correspondences, whereas the high-frequency components tend to introduce noise. Building on these insights, we propose a head- and frequency-aware feature selection strategy that jointly selects the most informative attention head and low-frequency components to enhance tracking performance. Specifically, our method extracts discriminative features through single-step denoising, applies feature selection, and employs soft-argmax localization with forward-backward consistency checks for correspondence estimation. Extensive experiments on TAP-Vid benchmarks demonstrate that HeFT achieves state-of-the-art zero-shot tracking performance, approaching the accuracy of supervised methods while eliminating the need for annotated training data. Our work further underscores the promise of video diffusion models as powerful foundation models for a wide range of downstream tasks, paving the way toward unified visual foundation models.

Denoise to Track: Harnessing Video Diffusion Priors for Robust Correspondence

TL;DR

This paper presents HeFT, a zero-shot point tracking framework that repurposes pretrained Video Diffusion Transformer priors to establish temporally consistent correspondences without labeled data. By revealing head-level specialization and the primacy of low-frequency features for matching, it introduces a frequency-aware feature selection strategy and a single-step denoising pipeline, paired with soft-argmax localization and forward-backward consistency for robust tracking. Empirical results on TAP-Vid show state-of-the-art zero-shot performance approaching supervised methods, underscoring diffusion models as powerful visual foundation models for downstream perception tasks. The work offers a step toward unified visual foundation models by extracting meaningful, granular priors from diffusion-based video representations.

Abstract

In this work, we introduce HeFT (Head-Frequency Tracker), a zero-shot point tracking framework that leverages the visual priors of pretrained video diffusion models. To better understand how they encode spatiotemporal information, we analyze the internal representations of Video Diffusion Transformer (VDiT). Our analysis reveals that attention heads act as minimal functional units with distinct specializations for matching, semantic understanding, and positional encoding. Additionally, we find that the low-frequency components in VDiT features are crucial for establishing correspondences, whereas the high-frequency components tend to introduce noise. Building on these insights, we propose a head- and frequency-aware feature selection strategy that jointly selects the most informative attention head and low-frequency components to enhance tracking performance. Specifically, our method extracts discriminative features through single-step denoising, applies feature selection, and employs soft-argmax localization with forward-backward consistency checks for correspondence estimation. Extensive experiments on TAP-Vid benchmarks demonstrate that HeFT achieves state-of-the-art zero-shot tracking performance, approaching the accuracy of supervised methods while eliminating the need for annotated training data. Our work further underscores the promise of video diffusion models as powerful foundation models for a wide range of downstream tasks, paving the way toward unified visual foundation models.

Paper Structure

This paper contains 19 sections, 4 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: HeFT. Our zero-shot point tracking framework exploits the visual priors of pretrained VDiT. A head- and frequency-aware feature selection strategy further enhances tracking robustness and accuracy.
  • Figure 2: Layer vs Head Performance. Performance comparison showing layer-level performance and the best, worst, and average performance of all heads within each layer.
  • Figure 3: Head Specialization. (a) Similarity heatmaps of three different attention heads across two frames. The blue hollow boxes indicate the query patches, while deeper red regions represent higher similarity scores. Different heads exhibit distinct cross-frame correspondence patterns, highlighting the diversity of attention heads. (b) Attention patterns for matching-head, semantic-head, and position-head.
  • Figure 4: Temporal rotation angles across frequency bands. High-frequency components (top 25%) exhibit significant rotations across frames, capturing fine-grained positional information, while low-frequency components (bottom 50%) remain nearly invariant, encoding stable semantic content.
  • Figure 5: Norms of query and key across frequency bands. matching-oriented head (L18H5) exhibits substantially larger norms in low-frequency components than in high-frequency ones, whereas the position-oriented head (L20H8) shows the opposite trend, with high-frequency components dominating
  • ...and 5 more figures