S2DM: Sector-Shaped Diffusion Models for Video Generation

Haoran Lang; Yuxuan Ge; Zheng Tian

S2DM: Sector-Shaped Diffusion Models for Video Generation

Haoran Lang, Yuxuan Ge, Zheng Tian

TL;DR

This work proposes a novel Sector-Shaped Diffusion Model (S2DM) whose sector-shaped diffusion region is formed by a set of ray-shaped reverse diffusion processes starting at the same noise point, and proposes a two-stage generation strategy which can decouple the generation of temporal features from semantic-content features.

Abstract

Diffusion models have achieved great success in image generation. However, when leveraging this idea for video generation, we face significant challenges in maintaining the consistency and continuity across video frames. This is mainly caused by the lack of an effective framework to align frames of videos with desired temporal features while preserving consistent semantic and stochastic features. In this work, we propose a novel Sector-Shaped Diffusion Model (S2DM) whose sector-shaped diffusion region is formed by a set of ray-shaped reverse diffusion processes starting at the same noise point. S2DM can generate a group of intrinsically related data sharing the same semantic and stochastic features while varying on temporal features with appropriate guided conditions. We apply S2DM to video generation tasks, and explore the use of optical flow as temporal conditions. Our experimental results show that S2DM outperforms many existing methods in the task of video generation without any temporal-feature modelling modules. For text-to-video generation tasks where temporal conditions are not explicitly given, we propose a two-stage generation strategy which can decouple the generation of temporal features from semantic-content features. We show that, without additional training, our model integrated with another temporal conditions generative model can still achieve comparable performance with existing works. Our results can be viewd at https://s2dm.github.io/S2DM/.

S2DM: Sector-Shaped Diffusion Models for Video Generation

TL;DR

Abstract

Paper Structure (17 sections, 12 equations, 6 figures, 2 tables)

This paper contains 17 sections, 12 equations, 6 figures, 2 tables.

Introduction
Related Work
Sector-Shaped Diffusion Model with Shared Noise
Standard Diffusion Process
Sector-Shaped Diffusion Process
Sector-Shaped Forward Diffusion Process
Sector-Shaped Backward Diffusion Process
Sector-Shaped Diffusion Process for Conditional Video Generation
Optical flow-guided Video Generation
Two-stage generation strategy for Text-to-Video Generation
Experiments
Experimental Setup
Optical Flow-Guided Video Generation
Two-stage Text-to-Video Generation
Ablation Study
...and 2 more sections

Figures (6)

Figure 1: Sector-Shaped Diffusion Model (S2DM) generates a video sample through a sector-shaped inverse diffusion area guided by two conditions. We also propose a two-stage generation strategy based on S2DM for high quality Text-to-Video generation task. More results can be viewd at https://s2dm.github.io/S2DM/.
Figure 2: Comparison between video frames under different scenarios. Each row represents a sequence of frames from a video: the above two rows depict "A person is doing weightlifting" while the last row illustrates "A little baby is crawling".
Figure 3: Sector-Shaped Diffusion Model (S2DM): The sector-shaped inverse diffusion area is expanded by a set of ray-shaped inverse diffusion processes starting from the same initial noise point, guided by an identical semantic condition to maintain the content consistency and a set of temporal conditions to assign corresponding temporal features to each generated data point.
Figure 4: Training (Left) and Inference (Right) stages of conditional video generation under Sector-Shaped Diffusion Model (S2DM) frame work (\ref{['conditional generation']}). During training stage, we perform shared-noise perturbation on the data from one video under the assumption of S2DM. Then we concatenate the extracted varying temporal conditions with the identical semantic condition to be the condition input of S2DM. As for inference stage, given temporal and semantic conditions, our S2DM iteratively generates various video frames based on the same initial random noise.
Figure 5: Inference process of Two-stage text-to-video generation pipeline (\ref{['text-to-video generation']}). Firstly, we employ the second-stage model $\epsilon_{\theta}$ to generate a reference frame $I_{ref}$, given text prompt $\tau$ and zero optical flow $f^1$. Secondly, we generate an optical flow sequence $\hat{\mathcal{F}}$ via the first stage model $\epsilon_{\phi}$ conditioned on text prompt $\tau$ and reference frame $I_{ref}$. Thirdly, we generate a sequence of video frames by the second-stage model $\epsilon_{\theta}$ conditioned on text prompt $\tau$ and synthesized flow sequence $\hat{\mathcal{F}}$, with the identical initial random noise sampled in first step.
...and 1 more figures

S2DM: Sector-Shaped Diffusion Models for Video Generation

TL;DR

Abstract

S2DM: Sector-Shaped Diffusion Models for Video Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (6)