Conditional Image-to-Video Generation with Latent Flow Diffusion Models

Haomiao Ni; Changhao Shi; Kai Li; Sharon X. Huang; Martin Renqiang Min

Conditional Image-to-Video Generation with Latent Flow Diffusion Models

Haomiao Ni, Changhao Shi, Kai Li, Sharon X. Huang, Martin Renqiang Min

TL;DR

The paper tackles conditional image-to-video generation by introducing latent flow diffusion models (LFDM) that generate a temporally coherent latent flow sequence conditioned on a class label to warp the input image in latent space. It uses a two-stage training pipeline: Stage 1 learns a latent flow auto-encoder to model spatial content and occlusions, while Stage 2 trains a 3D U-Net diffusion model to produce temporally coherent latent flow conditioned on the input image and condition. Inference warps the starting image's latent map using the generated latent flow to synthesize video frames, avoiding drift from frame-to-frame. Experiments on MUG, MHAD, NATOPS demonstrate LFDM's superiority over baselines in both conditional and stochastic settings and show the method's adaptability to new domains via decoder finetuning. The approach delivers a flexible, efficient framework for high-quality cI2V with potential for domain transfer and broader applications.

Abstract

Conditional image-to-video (cI2V) generation aims to synthesize a new plausible video starting from an image (e.g., a person's face) and a condition (e.g., an action class label like smile). The key challenge of the cI2V task lies in the simultaneous generation of realistic spatial appearance and temporal dynamics corresponding to the given image and condition. In this paper, we propose an approach for cI2V using novel latent flow diffusion models (LFDM) that synthesize an optical flow sequence in the latent space based on the given condition to warp the given image. Compared to previous direct-synthesis-based works, our proposed LFDM can better synthesize spatial details and temporal motion by fully utilizing the spatial content of the given image and warping it in the latent space according to the generated temporally-coherent flow. The training of LFDM consists of two separate stages: (1) an unsupervised learning stage to train a latent flow auto-encoder for spatial content generation, including a flow predictor to estimate latent flow between pairs of video frames, and (2) a conditional learning stage to train a 3D-UNet-based diffusion model (DM) for temporal latent flow generation. Unlike previous DMs operating in pixel space or latent feature space that couples spatial and temporal information, the DM in our LFDM only needs to learn a low-dimensional latent flow space for motion generation, thus being more computationally efficient. We conduct comprehensive experiments on multiple datasets, where LFDM consistently outperforms prior arts. Furthermore, we show that LFDM can be easily adapted to new domains by simply finetuning the image decoder. Our code is available at https://github.com/nihaomiao/CVPR23_LFDM.

Conditional Image-to-Video Generation with Latent Flow Diffusion Models

TL;DR

Abstract

Paper Structure (15 sections, 7 equations, 5 figures, 6 tables)

This paper contains 15 sections, 7 equations, 5 figures, 6 tables.

Introduction
Related Work
Image-to-Video Generation
Diffusion Models for Video Generation
Our Method
Diffusion Models
Training
Stage One: Latent Flow Auto-Encoder
Stage Two: Diffusion Model
Inference
Experiments
Datasets and Metrics
Model and Baseline Implementation
Result Analysis
Conclusion and Discussion

Figures (5)

Figure 1: Examples of generated video frames and latent flow sequences using our proposed LFDM. The first column shows the given images $x_0$ and conditions $y$. The latent flow maps are backward optical flow to $x_0$ in the latent space. We use the color coding scheme in baker2011database to visualize flow, where the color indicates the direction and magnitude of the flow.
Figure 2: The video generation (i.e., inference) process of LFDM. The generated flow sequence $\hat{\mathbf{f}}_1^K$ and occlusion map sequence $\hat{\mathbf{m}}_1^K$ have the same spatial size as image latent map $z_0$. The brighter regions in $\hat{\mathbf{m}}_1^K$ mean those are regions less likely to be occluded.
Figure 3: The training framework of LFDM. On the left is stage one for training latent flow auto-encoder while on the right is stage two for training diffusion model. In stage two, the encoder $\Phi$ is the one already trained in stage one, and the latent flow sequence $\mathbf{f}^K_1$ and occlusion map sequence $\mathbf{m}^K_1$ are estimated between $x_0$ and each frame in ground truth video $\mathbf{x}^K_1$ using the trained flow predictor $F$ from stage one.
Figure 4: Qualitative comparison among different methods on multiple datasets for cI2V generation. First image frame $x_0$ is highlighted with red box and condition $y$ is shown under each block. To simplify coding, all the models are designed to also generate starting frame $\hat{x}_0$. The video frames of GT (ground truth), LDM and LFDM have $128\times 128$ resolution while results of ImaGINator and VDM are $64\times64$.
Figure 5: Qualitative comparison of LFDM with original (the 1st&3rd rows) vs. finetuned (the 2nd&4th rows) decoder on FaceForensics dataset rossler2018faceforensics. The first column shows the given image $x_0$ and condition $y$. The green boxes highlight differences.

Conditional Image-to-Video Generation with Latent Flow Diffusion Models

TL;DR

Abstract

Conditional Image-to-Video Generation with Latent Flow Diffusion Models

Authors

TL;DR

Abstract

Table of Contents

Figures (5)