JVID: Joint Video-Image Diffusion for Visual-Quality and Temporal-Consistency in Video Generation
Hadrien Reynaud, Matthew Baugh, Mischa Dombrowski, Sarah Cechnicka, Qingjie Meng, Bernhard Kainz
TL;DR
This work tackles the challenge of producing high-quality, temporally coherent videos with diffusion models by proposing JVID, which jointly samples from a Latent Image Diffusion Model (LIDM) and a Latent Video Diffusion Model (LVDM) trained independently but under a shared perturbation framework. The key innovation is a mixture-of-denoising-models sampling strategy that selects between the image- and video-denoisers at each reverse-diffusion step to balance image fidelity and temporal consistency, augmented by inference-time entropy reduction and temporal latent smoothing. Trained on UCF-101 and evaluated across multiple resolutions, JVID achieves notable improvements in coherence and visual quality compared to several baselines, while maintaining a relatively lightened computational profile due to latent-space modeling. The work also provides an ablation study and releases pretrained models, highlighting the practical potential of combining independently trained diffusion models for flexible, conditioned video generation at scale, albeit with significant compute requirements still present for state-of-the-art results.
Abstract
We introduce the Joint Video-Image Diffusion model (JVID), a novel approach to generating high-quality and temporally coherent videos. We achieve this by integrating two diffusion models: a Latent Image Diffusion Model (LIDM) trained on images and a Latent Video Diffusion Model (LVDM) trained on video data. Our method combines these models in the reverse diffusion process, where the LIDM enhances image quality and the LVDM ensures temporal consistency. This unique combination allows us to effectively handle the complex spatio-temporal dynamics in video generation. Our results demonstrate quantitative and qualitative improvements in producing realistic and coherent videos.
