Table of Contents
Fetching ...

VideoGuide: Improving Video Diffusion Models without Training Through a Teacher's Guide

Dohun Lee, Bryan S Kim, Geon Yeong Park, Jong Chul Ye

TL;DR

VideoGuide addresses the enduring challenge of balancing temporal consistency and image fidelity in text-to-video diffusion. It achieves this by guiding the early reverse-diffusion steps of a student model with a pretrained teacher VDM through sample interpolation and optional filtering, all without any model training. The approach supports external guiding models and reveals a prior-distillation effect, where better data priors improve text coherence. Across multiple T2V backbones and open-source VDMs, VideoGuide delivers substantial improvements in temporal coherence and motion smoothness with minimal quality loss and notable speedups, enabling practical, inference-time enhancements to existing diffusion pipelines. This framing positions VideoGuide as a versatile, training-free tool for upgrading a wide range of video diffusion models while preserving their unique capabilities.

Abstract

Text-to-image (T2I) diffusion models have revolutionized visual content creation, but extending these capabilities to text-to-video (T2V) generation remains a challenge, particularly in preserving temporal consistency. Existing methods that aim to improve consistency often cause trade-offs such as reduced imaging quality and impractical computational time. To address these issues we introduce VideoGuide, a novel framework that enhances the temporal consistency of pretrained T2V models without the need for additional training or fine-tuning. Instead, VideoGuide leverages any pretrained video diffusion model (VDM) or itself as a guide during the early stages of inference, improving temporal quality by interpolating the guiding model's denoised samples into the sampling model's denoising process. The proposed method brings about significant improvement in temporal consistency and image fidelity, providing a cost-effective and practical solution that synergizes the strengths of various video diffusion models. Furthermore, we demonstrate prior distillation, revealing that base models can achieve enhanced text coherence by utilizing the superior data prior of the guiding model through the proposed method. Project Page: https://dohunlee1.github.io/videoguide.github.io/

VideoGuide: Improving Video Diffusion Models without Training Through a Teacher's Guide

TL;DR

VideoGuide addresses the enduring challenge of balancing temporal consistency and image fidelity in text-to-video diffusion. It achieves this by guiding the early reverse-diffusion steps of a student model with a pretrained teacher VDM through sample interpolation and optional filtering, all without any model training. The approach supports external guiding models and reveals a prior-distillation effect, where better data priors improve text coherence. Across multiple T2V backbones and open-source VDMs, VideoGuide delivers substantial improvements in temporal coherence and motion smoothness with minimal quality loss and notable speedups, enabling practical, inference-time enhancements to existing diffusion pipelines. This framing positions VideoGuide as a versatile, training-free tool for upgrading a wide range of video diffusion models while preserving their unique capabilities.

Abstract

Text-to-image (T2I) diffusion models have revolutionized visual content creation, but extending these capabilities to text-to-video (T2V) generation remains a challenge, particularly in preserving temporal consistency. Existing methods that aim to improve consistency often cause trade-offs such as reduced imaging quality and impractical computational time. To address these issues we introduce VideoGuide, a novel framework that enhances the temporal consistency of pretrained T2V models without the need for additional training or fine-tuning. Instead, VideoGuide leverages any pretrained video diffusion model (VDM) or itself as a guide during the early stages of inference, improving temporal quality by interpolating the guiding model's denoised samples into the sampling model's denoising process. The proposed method brings about significant improvement in temporal consistency and image fidelity, providing a cost-effective and practical solution that synergizes the strengths of various video diffusion models. Furthermore, we demonstrate prior distillation, revealing that base models can achieve enhanced text coherence by utilizing the superior data prior of the guiding model through the proposed method. Project Page: https://dohunlee1.github.io/videoguide.github.io/
Paper Structure (28 sections, 18 equations, 19 figures, 7 tables)

This paper contains 28 sections, 18 equations, 19 figures, 7 tables.

Figures (19)

  • Figure 1: VideoGuide is a novel framework for improving temporal consistency while preserving imaging quality, enabling high-quality video generation for diverse text prompts. By applying VideoGuide to underperforming base models, we can significantly improve temporal consistency with no additional training or fine-tuning. Best viewed with Acrobat Reader. Click each image to play the video clip.
  • Figure 2: Overall Pipeline. VideoGuide is a framework for enhancing temporal quality without additional training, leveraging the capabilities of any pretrained VDM. Throughout the denoising process of the sampling VDM, the guiding VDM receives an intermediate latent ${\boldsymbol z}_t$ and provides a temporally consistent sample ${\boldsymbol z}_{t-\tau}$ by proceeding in its own denoising for a small number of steps $\tau$. The sample ${\boldsymbol z}_{t-\tau}$ is then denoised and interpolated with the denoised ${\boldsymbol z}_t$ to produce a fused latent ${\boldsymbol z}'_{t}$. Such interpolation only needs to take part in the first few steps of inference, and effectively guides samples towards a direction of improved temporal consistency. To further ensure model flexibility in refining high-frequency areas for better image quality, the latent ${\boldsymbol z}'_{t}$ is passed through a Low-Pass Filter (LPF). Overall, VideoGuide is a straightforward addition to the original pipeline, yet it is powerful enough to significantly enhance temporal consistency without compromising imaging quality or motion smoothness.
  • Figure 3: Qualitative Comparison. VideoGuide is applied on various base models for different text prompts. For each prompt, frames of generated samples from four different models are displayed: (i) Base model (first row); (ii) Base model with FreeInit (second row); (iii) Base model with VideoGuide (self-guided case) (third row); (iv) Base model with VideoGuide (external model-guided case) (fourth row). AD, VC, LV indicate guidance models of AnimateDiff, VideoCrafter-2.0, LaVie, respectively. Samples for the base model show substandard temporal consistency, especially regarding color fluctuation and subject appearance change. Applying FreeInit improves consistency but introduces degradation in imaging quality, such as smoothing out of textural details. In contrast, applying VideoGuide significantly enhances temporal consistency while preserving imaging quality, both for the self-guided and the external model-guided case.
  • Figure 4: Prior Distillation Results. VideoGuide solves degraded performance regarding text coherency by enabling the utilization of a superior data prior. Example results for certain ambiguous prompts are displayed. For each prompt, the same random seed is shared for both methods. AnimateDiff directs generation of 'beetle' and 'jaguar' towards car samples due to a substandard data prior. Using VideoGuide, users can distill a superior prior for correct generation.
  • Figure 5: Comparison of Subject Consistency, Background Consistency, and Imaging Quality across interpolation steps ($I$) with and without the application of the low-frequency filter. Results indicate that the low-frequency filter accelerates convergence towards consistency while maintaining imaging quality.
  • ...and 14 more figures