Table of Contents
Fetching ...

Emergent Temporal Correspondences from Video Diffusion Transformers

Jisu Nam, Soowon Son, Dahyun Chung, Jiyoung Kim, Siyoon Jin, Junhwa Hur, Seungryong Kim

TL;DR

<3-5 sentence high-level summary> DiffTrack addresses how video Diffusion Transformers (DiTs) establish temporal correspondences across frames. It provides a dataset and metrics to quantify the role of 3D cross-frame attention, showing that query-key similarities in a small set of layers drive robust temporal matching, which strengthens during denoising. The framework enables true zero-shot point tracking and introduces Cross-Attention Guidance to improve motion coherence in generated videos without additional training. Additional findings extend to motion-aware generation and cross-backbone validation across CogVideoX, HunyuanVideo, and CogVideoX-I2V. Overall, DiffTrack offers a principled lens into temporal dynamics in DiTs and a practical toolkit for downstream motion tasks.

Abstract

Recent advancements in video diffusion models based on Diffusion Transformers (DiTs) have achieved remarkable success in generating temporally coherent videos. Yet, a fundamental question persists: how do these models internally establish and represent temporal correspondences across frames? We introduce DiffTrack, the first quantitative analysis framework designed to answer this question. DiffTrack constructs a dataset of prompt-generated video with pseudo ground-truth tracking annotations and proposes novel evaluation metrics to systematically analyze how each component within the full 3D attention mechanism of DiTs (e.g., representations, layers, and timesteps) contributes to establishing temporal correspondences. Our analysis reveals that query-key similarities in specific, but not all, layers play a critical role in temporal matching, and that this matching becomes increasingly prominent during the denoising process. We demonstrate practical applications of DiffTrack in zero-shot point tracking, where it achieves state-of-the-art performance compared to existing vision foundation and self-supervised video models. Further, we extend our findings to motion-enhanced video generation with a novel guidance method that improves temporal consistency of generated videos without additional training. We believe our work offers crucial insights into the inner workings of video DiTs and establishes a foundation for further research and applications leveraging their temporal understanding.

Emergent Temporal Correspondences from Video Diffusion Transformers

TL;DR

<3-5 sentence high-level summary> DiffTrack addresses how video Diffusion Transformers (DiTs) establish temporal correspondences across frames. It provides a dataset and metrics to quantify the role of 3D cross-frame attention, showing that query-key similarities in a small set of layers drive robust temporal matching, which strengthens during denoising. The framework enables true zero-shot point tracking and introduces Cross-Attention Guidance to improve motion coherence in generated videos without additional training. Additional findings extend to motion-aware generation and cross-backbone validation across CogVideoX, HunyuanVideo, and CogVideoX-I2V. Overall, DiffTrack offers a principled lens into temporal dynamics in DiTs and a practical toolkit for downstream motion tasks.

Abstract

Recent advancements in video diffusion models based on Diffusion Transformers (DiTs) have achieved remarkable success in generating temporally coherent videos. Yet, a fundamental question persists: how do these models internally establish and represent temporal correspondences across frames? We introduce DiffTrack, the first quantitative analysis framework designed to answer this question. DiffTrack constructs a dataset of prompt-generated video with pseudo ground-truth tracking annotations and proposes novel evaluation metrics to systematically analyze how each component within the full 3D attention mechanism of DiTs (e.g., representations, layers, and timesteps) contributes to establishing temporal correspondences. Our analysis reveals that query-key similarities in specific, but not all, layers play a critical role in temporal matching, and that this matching becomes increasingly prominent during the denoising process. We demonstrate practical applications of DiffTrack in zero-shot point tracking, where it achieves state-of-the-art performance compared to existing vision foundation and self-supervised video models. Further, we extend our findings to motion-enhanced video generation with a novel guidance method that improves temporal consistency of generated videos without additional training. We believe our work offers crucial insights into the inner workings of video DiTs and establishes a foundation for further research and applications leveraging their temporal understanding.

Paper Structure

This paper contains 66 sections, 8 equations, 32 figures, 7 tables.

Figures (32)

  • Figure 1: Teaser: DiffTrack reveals how video Diffusion Transformers (DiTs) establish temporal correspondences during video generation. Given a prompt and starting points, DiffTrack tracks how individual points align across subsequent frames via cross-frame attention in video DiTs (second row). This enables the extraction of coherent motion trajectories (third row) from both generated and real-world videos in a zero-shot manner.
  • Figure 2: Illustration of full 3D attention in video DiTs, where video frame latents and text embeddings interact.
  • Figure 3: Our curated evaluation dataset includes: (a) an object dataset for dynamic object-centric videos and (b) a scene dataset for static scenes with camera motion. Each dataset comprises 50 prompt-generated video pairs per video generative model (e.g. CogVideoX-2B yang2024cogvideox). In the benchmark, we predefine starting points in the first frame and obtain pseudo ground-truth trajectories using an off-the-shelf tracking method karaev2024cotracker.
  • Figure 4: Analysis of temporal matching in CogVideoX-2B yang2024cogvideox. (a) Query-key matching outperforms intermediate feature matching, highlighting the effectiveness of cross-frame interactions in 3D attention. (b) The harmonic mean of query-key matching shows that temporal matching is primarily driven by a few specific layers. (c) Temporal matching improves progressively during the denoising but slightly degrades near the final steps.
  • Figure 5: Evolution of attention scores across timesteps.
  • ...and 27 more figures