Image Diffusion Models Exhibit Emergent Temporal Propagation in Videos

Youngseo Kim; Dohyun Kim; Geonhee Han; Paul Hongsuck Seo

Image Diffusion Models Exhibit Emergent Temporal Propagation in Videos

Youngseo Kim, Dohyun Kim, Geonhee Han, Paul Hongsuck Seo

TL;DR

This paper tackles zero-shot video object tracking by repurposing pretrained image diffusion models. It shows that diffusion self-attention acts as a temporal label-propagation kernel across frames, enabling mask propagation without video-specific supervision. The authors introduce three test-time optimizations—DDIM inversion, mask-specific textual inversion, and adaptive head weighting—and integrate a SAM-based refinement to form Drift, a diffusion-based tracking framework that achieves state-of-the-art zero-shot performance on four standard VOS benchmarks. The results demonstrate robust temporal coherence, object fidelity, and practical potential for video understanding without task-specific training data. Overall, the work argues that diffusion-model representations offer a strong, generalizable foundation for video analysis and tracking, with broad implications for future video-language and segmentation systems.

Abstract

Image diffusion models, though originally developed for image generation, implicitly capture rich semantic structures that enable various recognition and localization tasks beyond synthesis. In this work, we investigate their self-attention maps can be reinterpreted as semantic label propagation kernels, providing robust pixel-level correspondences between relevant image regions. Extending this mechanism across frames yields a temporal propagation kernel that enables zero-shot object tracking via segmentation in videos. We further demonstrate the effectiveness of test-time optimization strategies-DDIM inversion, textual inversion, and adaptive head weighting-in adapting diffusion features for robust and consistent label propagation. Building on these findings, we introduce DRIFT, a framework for object tracking in videos leveraging a pretrained image diffusion model with SAM-guided mask refinement, achieving state-of-the-art zero-shot performance on standard video object segmentation benchmarks.

Image Diffusion Models Exhibit Emergent Temporal Propagation in Videos

TL;DR

Abstract

Image Diffusion Models Exhibit Emergent Temporal Propagation in Videos

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)