Table of Contents
Fetching ...

MoVideo: Motion-Aware Video Generation with Diffusion Models

Jingyun Liang, Yuchen Fan, Kai Zhang, Radu Timofte, Luc Van Gool, Rakesh Ranjan

TL;DR

MoVideo addresses the challenge of producing temporally coherent videos by explicitly modeling motion through video depth and optical flow. It introduces a four-stage diffusion-based framework that first generates depth and flows from a key frame, then performs latent-space video generation guided by depth, warped features, and occlusions, followed by flow-aware decoding for pixel-space reconstruction. The method leverages cross-attention conditioning on a key-frame embedding and fps to control content and motion, and employs a 3D UNet for spatio-temporal modeling, plus flow-guided alignment during decoding. Experiments demonstrate state-of-the-art performance in both text-to-video and image-to-video tasks, with strong prompt and frame consistency and high visual quality, highlighting MoVideo’s potential for open-domain video synthesis with controllable motion.

Abstract

While recent years have witnessed great progress on using diffusion models for video generation, most of them are simple extensions of image generation frameworks, which fail to explicitly consider one of the key differences between videos and images, i.e., motion. In this paper, we propose a novel motion-aware video generation (MoVideo) framework that takes motion into consideration from two aspects: video depth and optical flow. The former regulates motion by per-frame object distances and spatial layouts, while the later describes motion by cross-frame correspondences that help in preserving fine details and improving temporal consistency. More specifically, given a key frame that exists or generated from text prompts, we first design a diffusion model with spatio-temporal modules to generate the video depth and the corresponding optical flows. Then, the video is generated in the latent space by another spatio-temporal diffusion model under the guidance of depth, optical flow-based warped latent video and the calculated occlusion mask. Lastly, we use optical flows again to align and refine different frames for better video decoding from the latent space to the pixel space. In experiments, MoVideo achieves state-of-the-art results in both text-to-video and image-to-video generation, showing promising prompt consistency, frame consistency and visual quality.

MoVideo: Motion-Aware Video Generation with Diffusion Models

TL;DR

MoVideo addresses the challenge of producing temporally coherent videos by explicitly modeling motion through video depth and optical flow. It introduces a four-stage diffusion-based framework that first generates depth and flows from a key frame, then performs latent-space video generation guided by depth, warped features, and occlusions, followed by flow-aware decoding for pixel-space reconstruction. The method leverages cross-attention conditioning on a key-frame embedding and fps to control content and motion, and employs a 3D UNet for spatio-temporal modeling, plus flow-guided alignment during decoding. Experiments demonstrate state-of-the-art performance in both text-to-video and image-to-video tasks, with strong prompt and frame consistency and high visual quality, highlighting MoVideo’s potential for open-domain video synthesis with controllable motion.

Abstract

While recent years have witnessed great progress on using diffusion models for video generation, most of them are simple extensions of image generation frameworks, which fail to explicitly consider one of the key differences between videos and images, i.e., motion. In this paper, we propose a novel motion-aware video generation (MoVideo) framework that takes motion into consideration from two aspects: video depth and optical flow. The former regulates motion by per-frame object distances and spatial layouts, while the later describes motion by cross-frame correspondences that help in preserving fine details and improving temporal consistency. More specifically, given a key frame that exists or generated from text prompts, we first design a diffusion model with spatio-temporal modules to generate the video depth and the corresponding optical flows. Then, the video is generated in the latent space by another spatio-temporal diffusion model under the guidance of depth, optical flow-based warped latent video and the calculated occlusion mask. Lastly, we use optical flows again to align and refine different frames for better video decoding from the latent space to the pixel space. In experiments, MoVideo achieves state-of-the-art results in both text-to-video and image-to-video generation, showing promising prompt consistency, frame consistency and visual quality.
Paper Structure (19 sections, 11 equations, 5 figures, 6 tables)

This paper contains 19 sections, 11 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: The basic spatio-temporal block for building the 3D denoising UNet. We add temporal modules, including temporal convolution and temporal attention layers after spatial convolution and spacial attention layers. The ${fps}$ is encoded by a multi-layer perceptron and then added to the feature after 2D convolution, while key frame $x_{key}$ is encoded by the image encoder from an open-sourced pretrained image-text bi-encoder model and then injected to the 2D spatial attention layer by cross attention.
  • Figure 2: The comparison on different architectures for text-to-video generation without text-video training pairs. As the top route shows, some methods singer2022makeesser2023structure first encode the text with the text encoder from an open-sourced pretrained image-text bi-encoder model and then use a text-to-image prior ramesh2021zeroramesh2022hierarchical to transform it to the pooled image embedding, which is used as the condition to guide the generation of video. Instead, we propose to first generate an image by a public text-to-image latent diffusion model and extract its unpooled image embedding that preserves spatial layout and local details of the image, based on which we generate the depth and optical flow of the video and then use them to guide video generation.
  • Figure 3: The basic block for building the optical flow-augmented video decoding model. We add temporal convolution layers after spatial convolution layers to extract spatio-temporal video features. After that, with the optical flow ${o^{i2v}_f}$, we align the key frame $z_{key}$ towards the each frame as $\widetilde{z_f}$, which is concatenated with $z_f$ for feature refinement.
  • Figure 4: Visual comparison on text-to-video generation. For each example, the first row is from our method, while the second row is from VideoDiffusion yu2023video. More visual comparisons, including video results, are provided in the supplementary.
  • Figure 5: Visual comparison on image-to-video generation. The first three rows are guided by the generated depth and optical flows, while the rest rows are guided by the ground-truth (GT) ones. More visual comparisons, including video results, are provided in the supplementary.