Table of Contents
Fetching ...

From Image to Video: An Empirical Study of Diffusion Representations

Pedro Vélez, Luisa F. Polanía, Yi Yang, Chuhan Zhang, Rishabh Kabra, Anurag Arnab, Mehdi S. M. Sajjadi

TL;DR

This work addresses whether video diffusion representations outperform image diffusion representations for visual understanding. Using a unified Windowed-Attention Latent Transformer (WALT) and a probing framework, it compares the same architecture trained on video versus image generation across diverse downstream tasks, including classification, action recognition, depth, pose estimation, and tracking. The study shows that video-pre-trained representations are consistently stronger, especially for motion- and space-time–related tasks, and it analyzes how feature layer, noise level, model size, and pre-training budget influence both representation and generation quality. These findings illuminate the role of temporal information in diffusion-based representations and establish a baseline for future cross-architecture comparisons and scaling analyses.

Abstract

Diffusion models have revolutionized generative modeling, enabling unprecedented realism in image and video synthesis. This success has sparked interest in leveraging their representations for visual understanding tasks. While recent works have explored this potential for image generation, the visual understanding capabilities of video diffusion models remain largely uncharted. To address this gap, we systematically compare the same model architecture trained for video versus image generation, analyzing the performance of their latent representations on various downstream tasks including image classification, action recognition, depth estimation, and tracking. Results show that video diffusion models consistently outperform their image counterparts, though we find a striking range in the extent of this superiority. We further analyze features extracted from different layers and with varying noise levels, as well as the effect of model size and training budget on representation and generation quality. This work marks the first direct comparison of video and image diffusion objectives for visual understanding, offering insights into the role of temporal information in representation learning.

From Image to Video: An Empirical Study of Diffusion Representations

TL;DR

This work addresses whether video diffusion representations outperform image diffusion representations for visual understanding. Using a unified Windowed-Attention Latent Transformer (WALT) and a probing framework, it compares the same architecture trained on video versus image generation across diverse downstream tasks, including classification, action recognition, depth, pose estimation, and tracking. The study shows that video-pre-trained representations are consistently stronger, especially for motion- and space-time–related tasks, and it analyzes how feature layer, noise level, model size, and pre-training budget influence both representation and generation quality. These findings illuminate the role of temporal information in diffusion-based representations and establish a baseline for future cross-architecture comparisons and scaling analyses.

Abstract

Diffusion models have revolutionized generative modeling, enabling unprecedented realism in image and video synthesis. This success has sparked interest in leveraging their representations for visual understanding tasks. While recent works have explored this potential for image generation, the visual understanding capabilities of video diffusion models remain largely uncharted. To address this gap, we systematically compare the same model architecture trained for video versus image generation, analyzing the performance of their latent representations on various downstream tasks including image classification, action recognition, depth estimation, and tracking. Results show that video diffusion models consistently outperform their image counterparts, though we find a striking range in the extent of this superiority. We further analyze features extracted from different layers and with varying noise levels, as well as the effect of model size and training budget on representation and generation quality. This work marks the first direct comparison of video and image diffusion objectives for visual understanding, offering insights into the role of temporal information in representation learning.

Paper Structure

This paper contains 45 sections, 1 equation, 12 figures, 7 tables.

Figures (12)

  • Figure 1: Video vs. image diffusion representations -- The diffusion model V-WALT trained on generating videos learns better features than the same model trained on generating images (I-WALT, normalized to 100% here) as measured across a range of readout tasks. See \ref{['sec:exp:imagevsvideo']} for details.
  • Figure 2: Probing architecture -- We feed videos through the model and extract (frozen) intermediate features. Cross-attention modules then read out the label for the downstream tasks.
  • Figure 3: Feature visualization -- We show the major PCA component for the two models across a range of DAVIS videos. While I-WALT is sensitive to semantically important areas of the scene (e.g., all people in the second column), V-WALT is much more sensitive to the areas that experience motion within the video (e.g., only the wrestlers in the same video).
  • Figure 4: Feature visualization for different motions -- In the 4 brick videos, only the marked portion (highlighted in red) is played, while the rest remains frozen. We visualize tokens from the first, identical frame. As an image model, I-WALT consistently produces the same feature, while V-WALT shows high sensitivity to moving areas, reflected in the major principal component.
  • Figure 5: Influence of Noise and Block Choice on Readout Performance -- Relative change in downstream task performance when probing different noise levels (left, fixed block $l=16$) and intermediate WALT blocks (right, fixed noise $t=200$). Values below -10% are excluded for clarity. Optimal performance is generally observed with noise between 0 and 200 and blocks 11-16. Example noisy images (left) and PCA visualizations (right) are shown below the plots.
  • ...and 7 more figures