Table of Contents
Fetching ...

Image Diffusion Models Exhibit Emergent Temporal Propagation in Videos

Youngseo Kim, Dohyun Kim, Geonhee Han, Paul Hongsuck Seo

TL;DR

This paper tackles zero-shot video object tracking by repurposing pretrained image diffusion models. It shows that diffusion self-attention acts as a temporal label-propagation kernel across frames, enabling mask propagation without video-specific supervision. The authors introduce three test-time optimizations—DDIM inversion, mask-specific textual inversion, and adaptive head weighting—and integrate a SAM-based refinement to form Drift, a diffusion-based tracking framework that achieves state-of-the-art zero-shot performance on four standard VOS benchmarks. The results demonstrate robust temporal coherence, object fidelity, and practical potential for video understanding without task-specific training data. Overall, the work argues that diffusion-model representations offer a strong, generalizable foundation for video analysis and tracking, with broad implications for future video-language and segmentation systems.

Abstract

Image diffusion models, though originally developed for image generation, implicitly capture rich semantic structures that enable various recognition and localization tasks beyond synthesis. In this work, we investigate their self-attention maps can be reinterpreted as semantic label propagation kernels, providing robust pixel-level correspondences between relevant image regions. Extending this mechanism across frames yields a temporal propagation kernel that enables zero-shot object tracking via segmentation in videos. We further demonstrate the effectiveness of test-time optimization strategies-DDIM inversion, textual inversion, and adaptive head weighting-in adapting diffusion features for robust and consistent label propagation. Building on these findings, we introduce DRIFT, a framework for object tracking in videos leveraging a pretrained image diffusion model with SAM-guided mask refinement, achieving state-of-the-art zero-shot performance on standard video object segmentation benchmarks.

Image Diffusion Models Exhibit Emergent Temporal Propagation in Videos

TL;DR

This paper tackles zero-shot video object tracking by repurposing pretrained image diffusion models. It shows that diffusion self-attention acts as a temporal label-propagation kernel across frames, enabling mask propagation without video-specific supervision. The authors introduce three test-time optimizations—DDIM inversion, mask-specific textual inversion, and adaptive head weighting—and integrate a SAM-based refinement to form Drift, a diffusion-based tracking framework that achieves state-of-the-art zero-shot performance on four standard VOS benchmarks. The results demonstrate robust temporal coherence, object fidelity, and practical potential for video understanding without task-specific training data. Overall, the work argues that diffusion-model representations offer a strong, generalizable foundation for video analysis and tracking, with broad implications for future video-language and segmentation systems.

Abstract

Image diffusion models, though originally developed for image generation, implicitly capture rich semantic structures that enable various recognition and localization tasks beyond synthesis. In this work, we investigate their self-attention maps can be reinterpreted as semantic label propagation kernels, providing robust pixel-level correspondences between relevant image regions. Extending this mechanism across frames yields a temporal propagation kernel that enables zero-shot object tracking via segmentation in videos. We further demonstrate the effectiveness of test-time optimization strategies-DDIM inversion, textual inversion, and adaptive head weighting-in adapting diffusion features for robust and consistent label propagation. Building on these findings, we introduce DRIFT, a framework for object tracking in videos leveraging a pretrained image diffusion model with SAM-guided mask refinement, achieving state-of-the-art zero-shot performance on standard video object segmentation benchmarks.

Paper Structure

This paper contains 35 sections, 4 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Visualization of Label Propagation via Self-Attention from a Text-to-Image Diffusion Models. Given an input image (a), the coarse map (b)—which corresponds to the cross-attention response for the token “cat”—provides approximate object localization based on the text prompt, while the self-attention map (c) captures semantic affinities across image regions to refine the coarse localization. Leveraging the self-attention map as a learned label propagation kernel, the coarse map is propagated to yield the final mask (d), which achieves substantially improved spatial precision, closely aligning with the GT mask (e).
  • Figure 2: Comparison of Cosine Similarity vs. Self-attention for Label Propagation. (a) The blue dot in the frame $t$ is propagated to frame $t{'}$. (b) Cosine similarity produces dispersed activations scattered across unrelated regions. (c) The aggregated self-attention map, in contrast, focuses sharply on the corresponding object region. (d) Individual attention heads exhibit complementary but distinct patterns, highlighting the diverse semantic relationships captured by multi-head self-attention.
  • Figure 3: Comparison of Per-frame $\mathcal{J}\&\mathcal{F}_\mathrm{m}$ between Self-attention and Cosine-similarity Affinity Maps on DAVIS 2017. Corresponding mask visualizations highlight that cosine-similarity produces noisy and dispersed affinities, whereas self-attention yields spatially precise and temporally consistent masks.
  • Figure 4: Comparison of $\mathcal{J}\&\mathcal{F}_\mathrm{m}$ between Random Noise Injection and DDIM Inversion across Diffusion Timesteps $\tau$ on DAVIS 2017. DDIM inversion (blue) achieves higher peak performance and maintains stable segmentation quality across timesteps, whereas random noise injection (orange) rapidly degrades after its early peak due to semantic washout at large $\tau$.
  • Figure 5: Visualization of t-SNE Embeddings. Comparison between learned embeddings and class name embeddings, which shows that the learned embeddings form distinct clusters separated from class name embeddings.
  • ...and 3 more figures