JVID: Joint Video-Image Diffusion for Visual-Quality and Temporal-Consistency in Video Generation

Hadrien Reynaud; Matthew Baugh; Mischa Dombrowski; Sarah Cechnicka; Qingjie Meng; Bernhard Kainz

JVID: Joint Video-Image Diffusion for Visual-Quality and Temporal-Consistency in Video Generation

Hadrien Reynaud, Matthew Baugh, Mischa Dombrowski, Sarah Cechnicka, Qingjie Meng, Bernhard Kainz

TL;DR

This work tackles the challenge of producing high-quality, temporally coherent videos with diffusion models by proposing JVID, which jointly samples from a Latent Image Diffusion Model (LIDM) and a Latent Video Diffusion Model (LVDM) trained independently but under a shared perturbation framework. The key innovation is a mixture-of-denoising-models sampling strategy that selects between the image- and video-denoisers at each reverse-diffusion step to balance image fidelity and temporal consistency, augmented by inference-time entropy reduction and temporal latent smoothing. Trained on UCF-101 and evaluated across multiple resolutions, JVID achieves notable improvements in coherence and visual quality compared to several baselines, while maintaining a relatively lightened computational profile due to latent-space modeling. The work also provides an ablation study and releases pretrained models, highlighting the practical potential of combining independently trained diffusion models for flexible, conditioned video generation at scale, albeit with significant compute requirements still present for state-of-the-art results.

Abstract

We introduce the Joint Video-Image Diffusion model (JVID), a novel approach to generating high-quality and temporally coherent videos. We achieve this by integrating two diffusion models: a Latent Image Diffusion Model (LIDM) trained on images and a Latent Video Diffusion Model (LVDM) trained on video data. Our method combines these models in the reverse diffusion process, where the LIDM enhances image quality and the LVDM ensures temporal consistency. This unique combination allows us to effectively handle the complex spatio-temporal dynamics in video generation. Our results demonstrate quantitative and qualitative improvements in producing realistic and coherent videos.

JVID: Joint Video-Image Diffusion for Visual-Quality and Temporal-Consistency in Video Generation

TL;DR

Abstract

Paper Structure (18 sections, 8 equations, 4 figures, 6 tables)

This paper contains 18 sections, 8 equations, 4 figures, 6 tables.

Introduction
Related Works
Diffusion models
Diffusion-based Image Models
Diffusion-based Video Models
Methods
Denoising Diffusion Probabilistic Models
Latent Image Diffusion Models
Latent Video Diffusion Models
Mixture of Denoising Models
Inference Entropy Reduction
Temporal Latent Smoothing
Experiments
Model training
Models evaluation
...and 3 more sections

Figures (4)

Figure 1: Video samples generated with our JVID model, combining both an image and a video diffusion model during sampling, to produce high quality and temporally coherent videos. Rows 1-4 are generated at $128 \times 128$, while rows 5-8 are $64 \times 64$.
Figure 2: Our proposed mixture of denoising model sampling approach. At each step, we select one model, which is used to denoise our noise sample $\bm{s}_t$. When the sample is fully denoised, it is decoded with the VAE decoder to reconstruct the generated video frames.
Figure 3: Model selection probability function. Parameters $t_v$, $t_e$, $p_e$, $p_f$ are determined empirically.
Figure 4: Our JVID model (top two rows) demonstrates higher image quality and temporal consistency compared to StyleGAN-V skorokhodov2021stylegan (bottom two rows).

JVID: Joint Video-Image Diffusion for Visual-Quality and Temporal-Consistency in Video Generation

TL;DR

Abstract

JVID: Joint Video-Image Diffusion for Visual-Quality and Temporal-Consistency in Video Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (4)