Table of Contents
Fetching ...

HARIVO: Harnessing Text-to-Image Models for Video Generation

Mingi Kwon, Seoung Wug Oh, Yang Zhou, Difan Liu, Joon-Young Lee, Haoran Cai, Baqiao Liu, Feng Liu, Youngjung Uh

TL;DR

This work proposes a unique architecture, incorporating a mapping network and frame-wise tokens, tailored for video generation while maintaining the diversity and creativity of the original T2I model, and successfully integrated video-specific inductive biases into the architecture and loss functions.

Abstract

We present a method to create diffusion-based video models from pretrained Text-to-Image (T2I) models. Recently, AnimateDiff proposed freezing the T2I model while only training temporal layers. We advance this method by proposing a unique architecture, incorporating a mapping network and frame-wise tokens, tailored for video generation while maintaining the diversity and creativity of the original T2I model. Key innovations include novel loss functions for temporal smoothness and a mitigating gradient sampling technique, ensuring realistic and temporally consistent video generation despite limited public video data. We have successfully integrated video-specific inductive biases into the architecture and loss functions. Our method, built on the frozen StableDiffusion model, simplifies training processes and allows for seamless integration with off-the-shelf models like ControlNet and DreamBooth. project page: https://kwonminki.github.io/HARIVO

HARIVO: Harnessing Text-to-Image Models for Video Generation

TL;DR

This work proposes a unique architecture, incorporating a mapping network and frame-wise tokens, tailored for video generation while maintaining the diversity and creativity of the original T2I model, and successfully integrated video-specific inductive biases into the architecture and loss functions.

Abstract

We present a method to create diffusion-based video models from pretrained Text-to-Image (T2I) models. Recently, AnimateDiff proposed freezing the T2I model while only training temporal layers. We advance this method by proposing a unique architecture, incorporating a mapping network and frame-wise tokens, tailored for video generation while maintaining the diversity and creativity of the original T2I model. Key innovations include novel loss functions for temporal smoothness and a mitigating gradient sampling technique, ensuring realistic and temporally consistent video generation despite limited public video data. We have successfully integrated video-specific inductive biases into the architecture and loss functions. Our method, built on the frozen StableDiffusion model, simplifies training processes and allows for seamless integration with off-the-shelf models like ControlNet and DreamBooth. project page: https://kwonminki.github.io/HARIVO

Paper Structure

This paper contains 34 sections, 7 equations, 10 figures, 5 tables, 1 algorithm.

Figures (10)

  • Figure 1: Our method generates high quality videos for given text prompts and is easy to combine with other methods such as ControlNet.
  • Figure 2: An overview of our method: illustration and results. We propose a method where we freeze a Text-to-Image model and train only temporal layers, including the additional proposed networks. This technique enables the successful generation of videos in various styles.
  • Figure 3: (a) Our mapping network in front of the U-Net maps IID Gaussian noise to a proper distribution for video generation. $L_\text{reg}$ penalizes difference between frames with and without temporal layers $l_\theta$ to preserve the expertise of the pretrained T2I model. (b) Temporal regularized self-attention loss penalizes the difference between self-attention maps of adjacent frames in the frozen T2I model to improve smoothness between consecutive frames.
  • Figure 4: (a) Decoupled contrastive loss on $\mathbf{h}$-space encourages semantic consistency within a video. $\mathbf{h}$-space is the bottleneck of U-Net. Positive pairs are randomly chosen from a video and negative samples are stored in a queue. $g_\theta$ is a projection layer used only for training. (b) Frame-wise token generator$t_{\theta}$ produces frame-wise tokens that represent subtle difference across frames. These tokens are concatenated to the text tokens to be fed into the cross-attention layers.
  • Figure 5: Our method generates high quality videos for given text prompts up to 512$^2$ resolution.
  • ...and 5 more figures