Table of Contents
Fetching ...

Fine-gained Zero-shot Video Sampling

Dengsheng Chen, Jie Hu, Xiaoming Wei, Enhua Wu

TL;DR

This work introduces Zero-Shot video Sampling (ZS$^2$), a training-free method to extract high-quality, temporally coherent videos from pretrained image diffusion models. It combines a dependency noise model, which imposes KL-divergence guided correlations across frame noises, with temporal momentum attention that blends self- and cross-frame attention to regulate motion while preserving appearance. A two-stage noise search algorithm ensures valid Gaussian noise sequences for long video clips and integrates seamlessly with DDIM sampling, enabling broad compatibility and low overhead. The approach demonstrates strong zero-shot performance, competitive with supervised methods, and enables conditional, specialized, and instruction-guided video generation with minimal additional training, potentially mitigating catastrophic forgetting of image priors. Post-processing with spatio-temporal super-resolution further enhances quality, making ZS$^2$ a practical, scalable pathway for democratizing text-to-video generation from image diffusion priors.

Abstract

Incorporating a temporal dimension into pretrained image diffusion models for video generation is a prevalent approach. However, this method is computationally demanding and necessitates large-scale video datasets. More critically, the heterogeneity between image and video datasets often results in catastrophic forgetting of the image expertise. Recent attempts to directly extract video snippets from image diffusion models have somewhat mitigated these problems. Nevertheless, these methods can only generate brief video clips with simple movements and fail to capture fine-grained motion or non-grid deformation. In this paper, we propose a novel Zero-Shot video Sampling algorithm, denoted as $\mathcal{ZS}^2$, capable of directly sampling high-quality video clips from existing image synthesis methods, such as Stable Diffusion, without any training or optimization. Specifically, $\mathcal{ZS}^2$ utilizes the dependency noise model and temporal momentum attention to ensure content consistency and animation coherence, respectively. This ability enables it to excel in related tasks, such as conditional and context-specialized video generation and instruction-guided video editing. Experimental results demonstrate that $\mathcal{ZS}^2$ achieves state-of-the-art performance in zero-shot video generation, occasionally outperforming recent supervised methods. Homepage: \url{https://densechen.github.io/zss/}.

Fine-gained Zero-shot Video Sampling

TL;DR

This work introduces Zero-Shot video Sampling (ZS), a training-free method to extract high-quality, temporally coherent videos from pretrained image diffusion models. It combines a dependency noise model, which imposes KL-divergence guided correlations across frame noises, with temporal momentum attention that blends self- and cross-frame attention to regulate motion while preserving appearance. A two-stage noise search algorithm ensures valid Gaussian noise sequences for long video clips and integrates seamlessly with DDIM sampling, enabling broad compatibility and low overhead. The approach demonstrates strong zero-shot performance, competitive with supervised methods, and enables conditional, specialized, and instruction-guided video generation with minimal additional training, potentially mitigating catastrophic forgetting of image priors. Post-processing with spatio-temporal super-resolution further enhances quality, making ZS a practical, scalable pathway for democratizing text-to-video generation from image diffusion priors.

Abstract

Incorporating a temporal dimension into pretrained image diffusion models for video generation is a prevalent approach. However, this method is computationally demanding and necessitates large-scale video datasets. More critically, the heterogeneity between image and video datasets often results in catastrophic forgetting of the image expertise. Recent attempts to directly extract video snippets from image diffusion models have somewhat mitigated these problems. Nevertheless, these methods can only generate brief video clips with simple movements and fail to capture fine-grained motion or non-grid deformation. In this paper, we propose a novel Zero-Shot video Sampling algorithm, denoted as , capable of directly sampling high-quality video clips from existing image synthesis methods, such as Stable Diffusion, without any training or optimization. Specifically, utilizes the dependency noise model and temporal momentum attention to ensure content consistency and animation coherence, respectively. This ability enables it to excel in related tasks, such as conditional and context-specialized video generation and instruction-guided video editing. Experimental results demonstrate that achieves state-of-the-art performance in zero-shot video generation, occasionally outperforming recent supervised methods. Homepage: \url{https://densechen.github.io/zss/}.
Paper Structure (40 sections, 10 equations, 17 figures, 1 table, 1 algorithm)

This paper contains 40 sections, 10 equations, 17 figures, 1 table, 1 algorithm.

Figures (17)

  • Figure 1: Our method is capable of sampling more detailed and semantically rich motion variations.
  • Figure 2: Our method works well across different image diffusion models.
  • Figure 3: Comparison with baseline: Text2Video-Zero Text2Video-Zero.(Both sampled from Dreamlike Photoreal v2.0dpv2.0)
  • Figure 4: The motion is regulated by $\lambda_i$ and $\mu_i$. We present several video samples from the pose guidance task. From the first and second rows, it is evident that different values of $\lambda_i$ and $\mu_i$ can effectively control the variations in video content. (Best viewed in our homepage.)
  • Figure 5: Post-processing sampled video clip (the left) with temporal super-resolution model (the middle) and following a spatial super-resolution model (the right). Prompts: An unstable rock cairn in the middle of a stream. (Best viewed in our homepage.)
  • ...and 12 more figures