VideoGuide: Improving Video Diffusion Models without Training Through a Teacher's Guide
Dohun Lee, Bryan S Kim, Geon Yeong Park, Jong Chul Ye
TL;DR
VideoGuide addresses the enduring challenge of balancing temporal consistency and image fidelity in text-to-video diffusion. It achieves this by guiding the early reverse-diffusion steps of a student model with a pretrained teacher VDM through sample interpolation and optional filtering, all without any model training. The approach supports external guiding models and reveals a prior-distillation effect, where better data priors improve text coherence. Across multiple T2V backbones and open-source VDMs, VideoGuide delivers substantial improvements in temporal coherence and motion smoothness with minimal quality loss and notable speedups, enabling practical, inference-time enhancements to existing diffusion pipelines. This framing positions VideoGuide as a versatile, training-free tool for upgrading a wide range of video diffusion models while preserving their unique capabilities.
Abstract
Text-to-image (T2I) diffusion models have revolutionized visual content creation, but extending these capabilities to text-to-video (T2V) generation remains a challenge, particularly in preserving temporal consistency. Existing methods that aim to improve consistency often cause trade-offs such as reduced imaging quality and impractical computational time. To address these issues we introduce VideoGuide, a novel framework that enhances the temporal consistency of pretrained T2V models without the need for additional training or fine-tuning. Instead, VideoGuide leverages any pretrained video diffusion model (VDM) or itself as a guide during the early stages of inference, improving temporal quality by interpolating the guiding model's denoised samples into the sampling model's denoising process. The proposed method brings about significant improvement in temporal consistency and image fidelity, providing a cost-effective and practical solution that synergizes the strengths of various video diffusion models. Furthermore, we demonstrate prior distillation, revealing that base models can achieve enhanced text coherence by utilizing the superior data prior of the guiding model through the proposed method. Project Page: https://dohunlee1.github.io/videoguide.github.io/
