SF-V: Single Forward Video Generation Model

Zhixing Zhang; Yanyu Li; Yushu Wu; Yanwu Xu; Anil Kag; Ivan Skorokhodov; Willi Menapace; Aliaksandr Siarohin; Junli Cao; Dimitris Metaxas; Sergey Tulyakov; Jian Ren

SF-V: Single Forward Video Generation Model

Zhixing Zhang, Yanyu Li, Yushu Wu, Yanwu Xu, Anil Kag, Ivan Skorokhodov, Willi Menapace, Aliaksandr Siarohin, Junli Cao, Dimitris Metaxas, Sergey Tulyakov, Jian Ren

TL;DR

This work proposes a novel approach to obtain single-step video generation models by leveraging adversarial training to fine-tune pre-trained video diffusion models, and shows that the multi-steps video diffusion model, i.e., Stable Video Diffusion (SVD), can be trained to perform single forward pass to synthesize high-quality videos.

Abstract

Diffusion-based video generation models have demonstrated remarkable success in obtaining high-fidelity videos through the iterative denoising process. However, these models require multiple denoising steps during sampling, resulting in high computational costs. In this work, we propose a novel approach to obtain single-step video generation models by leveraging adversarial training to fine-tune pre-trained video diffusion models. We show that, through the adversarial training, the multi-steps video diffusion model, i.e., Stable Video Diffusion (SVD), can be trained to perform single forward pass to synthesize high-quality videos, capturing both temporal and spatial dependencies in the video data. Extensive experiments demonstrate that our method achieves competitive generation quality of synthesized videos with significantly reduced computational overhead for the denoising process (i.e., around $23\times$ speedup compared with SVD and $6\times$ speedup compared with existing works, with even better generation quality), paving the way for real-time video synthesis and editing. More visualization results are made publicly available at https://snap-research.github.io/SF-V.

SF-V: Single Forward Video Generation Model

TL;DR

Abstract

speedup compared with SVD and

speedup compared with existing works, with even better generation quality), paving the way for real-time video synthesis and editing. More visualization results are made publicly available at https://snap-research.github.io/SF-V.

Paper Structure (11 sections, 11 equations, 8 figures, 3 tables)

This paper contains 11 sections, 11 equations, 8 figures, 3 tables.

Introduction
Related Work
Method
Preliminaries of Stable Video Diffusion
Latent Adversarial Training for Video Diffusion Model
Spatial Temporal Heads
Experiment
Qualitative Visualization
Comparisons Results
Ablation Analysis
Discussion and Conclusion

Figures (8)

Figure 1: Example generation results from our single-step image-to-video model. Our model can generate high-quality and motion consistent videos by only performing the sampling once during inference. Please refer to our https://snap-research.github.io/SF-V for whole video sequences.
Figure 2: Training Pipeline. We initialize our generator and discriminator using the weights of a pre-trained image-to-video diffusion model. The discriminator utilizes the encoder part of the UNet as its backbone, which remains frozen during training. We add a spatial discriminator head and a temporal discriminator head after each downsampling block of the discriminator backbone and only update the parameters of these heads during training. Given a video latent $x_0$, we first add noise $\sigma_t$ through a forward diffusion process to obtain $x_t$. The generator then predicts $\hat{x}_0$ given $x_t$. We calculate the reconstruction loss $\mathcal{L}_{recon}$ between $x_0$ and $\hat{x}_0$. Additionally, we add noise level $\sigma_t^\prime$ to both $x_0$ and $\hat{x}_0$ to obtain real and fake samples, $x_t^\prime$ and $\hat{x}_t^\prime$. The adversarial loss $\mathcal{L}_{adv}$ is then calculated using these real and fake sample pairs.
Figure 3: Spatial & Temporal Discriminator Heads. Our discriminator heads take in intermediate features of the UNet encoder. Follow exiting arts sauer2021projectedsauer2023stylegan, we use image conditioning and frame index as the projected condition $\mathbf{c}$. Left: For spatial discriminator heads, the input features are reshaped to merge the temporal axis and the batch axis, such that each frame is considered as an independent sample. Right: For temporal discriminator heads, we merge spatial dimensions to batch axis.
Figure 4: Video Generation on Single Conditioning Images from Various Domains. We employ our method on various images generated by SDXL podell2023sdxl to synthesized videos. The videos contain $14$-frame at a resolution of $1024 \times 576$ with $7$ FPS. The results demonstrate that our model can generate high-quality motion-consistent videos of various objects across different domains. Please refer to our https://snap-research.github.io/SF-V for whole video sequences.
Figure 5: Comparison between SVD blattmann2023stable, AnimateLCM animatelcm, LADD ladd, UFOGen ufogen, and Our Approach. We provide the synthesized videos (sampled frames) under various settings for different approaches. We use SVD to generate videos under $25$, $16$, and $8$ sampling steps, AnimateLCM to synthesize videos under $4$ sampling steps, LADD and UFOGen to generate videos under $1$ sampling step. AnimateLCM, LADD and UFOGen generates blurry frames with few-steps and single-step sampling. Our approach can accelerate the sampling speed by $22.9\times$ compared with SVD while maintaining similar frame quality and motion consistency.
...and 3 more figures

SF-V: Single Forward Video Generation Model

TL;DR

Abstract

SF-V: Single Forward Video Generation Model

Authors

TL;DR

Abstract

Table of Contents

Figures (8)