Table of Contents
Fetching ...

Factorized-Dreamer: Training A High-Quality Video Generator with Limited and Low-Quality Data

Tao Yang, Yangming Shi, Yunwen Huang, Feng Chen, Yin Zheng, Lei Zhang

TL;DR

This work tackles the problem of generating high-quality videos from limited and low-quality data without recaptioning or finetuning. It introduces Factorized-Dreamer, a two-stage framework that first generates an image from a descriptive caption and then synthesizes a video conditioned on that image plus a concise motion caption. Key innovations include a Text-to-Image Adapter, Pixel-Aware Cross Attention (PACA), a T5 encoder for motion understanding, and a PredictNet for motion supervision, all trained under a carefully designed noise schedule. The model, trained on publicly available WebVid-10M and PexelVideos, achieves competitive T2V performance and strong I2V results, outperforming several open-source baselines and approaching commercial systems, thereby lowering data requirements for HQ video generation. This approach broadens access to HQ video synthesis and sets a foundation for further improvements in long-video coherence and motion realism using public data and factorized generation strategies.

Abstract

Text-to-video (T2V) generation has gained significant attention due to its wide applications to video generation, editing, enhancement and translation, \etc. However, high-quality (HQ) video synthesis is extremely challenging because of the diverse and complex motions existed in real world. Most existing works struggle to address this problem by collecting large-scale HQ videos, which are inaccessible to the community. In this work, we show that publicly available limited and low-quality (LQ) data are sufficient to train a HQ video generator without recaptioning or finetuning. We factorize the whole T2V generation process into two steps: generating an image conditioned on a highly descriptive caption, and synthesizing the video conditioned on the generated image and a concise caption of motion details. Specifically, we present \emph{Factorized-Dreamer}, a factorized spatiotemporal framework with several critical designs for T2V generation, including an adapter to combine text and image embeddings, a pixel-aware cross attention module to capture pixel-level image information, a T5 text encoder to better understand motion description, and a PredictNet to supervise optical flows. We further present a noise schedule, which plays a key role in ensuring the quality and stability of video generation. Our model lowers the requirements in detailed captions and HQ videos, and can be directly trained on limited LQ datasets with noisy and brief captions such as WebVid-10M, largely alleviating the cost to collect large-scale HQ video-text pairs. Extensive experiments in a variety of T2V and image-to-video generation tasks demonstrate the effectiveness of our proposed Factorized-Dreamer. Our source codes are available at \url{https://github.com/yangxy/Factorized-Dreamer/}.

Factorized-Dreamer: Training A High-Quality Video Generator with Limited and Low-Quality Data

TL;DR

This work tackles the problem of generating high-quality videos from limited and low-quality data without recaptioning or finetuning. It introduces Factorized-Dreamer, a two-stage framework that first generates an image from a descriptive caption and then synthesizes a video conditioned on that image plus a concise motion caption. Key innovations include a Text-to-Image Adapter, Pixel-Aware Cross Attention (PACA), a T5 encoder for motion understanding, and a PredictNet for motion supervision, all trained under a carefully designed noise schedule. The model, trained on publicly available WebVid-10M and PexelVideos, achieves competitive T2V performance and strong I2V results, outperforming several open-source baselines and approaching commercial systems, thereby lowering data requirements for HQ video generation. This approach broadens access to HQ video synthesis and sets a foundation for further improvements in long-video coherence and motion realism using public data and factorized generation strategies.

Abstract

Text-to-video (T2V) generation has gained significant attention due to its wide applications to video generation, editing, enhancement and translation, \etc. However, high-quality (HQ) video synthesis is extremely challenging because of the diverse and complex motions existed in real world. Most existing works struggle to address this problem by collecting large-scale HQ videos, which are inaccessible to the community. In this work, we show that publicly available limited and low-quality (LQ) data are sufficient to train a HQ video generator without recaptioning or finetuning. We factorize the whole T2V generation process into two steps: generating an image conditioned on a highly descriptive caption, and synthesizing the video conditioned on the generated image and a concise caption of motion details. Specifically, we present \emph{Factorized-Dreamer}, a factorized spatiotemporal framework with several critical designs for T2V generation, including an adapter to combine text and image embeddings, a pixel-aware cross attention module to capture pixel-level image information, a T5 text encoder to better understand motion description, and a PredictNet to supervise optical flows. We further present a noise schedule, which plays a key role in ensuring the quality and stability of video generation. Our model lowers the requirements in detailed captions and HQ videos, and can be directly trained on limited LQ datasets with noisy and brief captions such as WebVid-10M, largely alleviating the cost to collect large-scale HQ video-text pairs. Extensive experiments in a variety of T2V and image-to-video generation tasks demonstrate the effectiveness of our proposed Factorized-Dreamer. Our source codes are available at \url{https://github.com/yangxy/Factorized-Dreamer/}.
Paper Structure (26 sections, 7 equations, 7 figures, 5 tables)

This paper contains 26 sections, 7 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Illustration of common video generation frameworks.
  • Figure 2: Architecture of the proposed Factorized-Dreamer, which consists of a T2I adapter, PACA modules, a T5 text encoder and a PredictNet. During training, the encoder maps the input video to a latent representation, which is then added by noise. The noisy latent is fed to the UNet along with the latent first frame, the T5 text embedding, and a combined text-image embedding by the T2I adapter. The latent first frame and the combined text-image embedding are added to the UNet via PACA and CA in the upsampling layers, respectively. A PredictNet is introduced to enhance the motion coherence by supervising optical flows.
  • Figure 3: Text-to-Video results by different methods.
  • Figure 4: Image-to-Video results by different methods.
  • Figure 5: Curves of log SNR to timestep. $s=0.125$.
  • ...and 2 more figures