From Image to Video: An Empirical Study of Diffusion Representations
Pedro Vélez, Luisa F. Polanía, Yi Yang, Chuhan Zhang, Rishabh Kabra, Anurag Arnab, Mehdi S. M. Sajjadi
TL;DR
This work addresses whether video diffusion representations outperform image diffusion representations for visual understanding. Using a unified Windowed-Attention Latent Transformer (WALT) and a probing framework, it compares the same architecture trained on video versus image generation across diverse downstream tasks, including classification, action recognition, depth, pose estimation, and tracking. The study shows that video-pre-trained representations are consistently stronger, especially for motion- and space-time–related tasks, and it analyzes how feature layer, noise level, model size, and pre-training budget influence both representation and generation quality. These findings illuminate the role of temporal information in diffusion-based representations and establish a baseline for future cross-architecture comparisons and scaling analyses.
Abstract
Diffusion models have revolutionized generative modeling, enabling unprecedented realism in image and video synthesis. This success has sparked interest in leveraging their representations for visual understanding tasks. While recent works have explored this potential for image generation, the visual understanding capabilities of video diffusion models remain largely uncharted. To address this gap, we systematically compare the same model architecture trained for video versus image generation, analyzing the performance of their latent representations on various downstream tasks including image classification, action recognition, depth estimation, and tracking. Results show that video diffusion models consistently outperform their image counterparts, though we find a striking range in the extent of this superiority. We further analyze features extracted from different layers and with varying noise levels, as well as the effect of model size and training budget on representation and generation quality. This work marks the first direct comparison of video and image diffusion objectives for visual understanding, offering insights into the role of temporal information in representation learning.
