Table of Contents
Fetching ...

xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations

Can Qin, Congying Xia, Krithika Ramakrishnan, Michael Ryoo, Lifu Tu, Yihao Feng, Manli Shu, Honglu Zhou, Anas Awadalla, Jun Wang, Senthil Purushwalkam, Le Xue, Yingbo Zhou, Huan Wang, Silvio Savarese, Juan Carlos Niebles, Zeyuan Chen, Ran Xu, Caiming Xiong

TL;DR

This work tackles the challenge of high-fidelity text-to-video generation with long sequences by introducing xGen-VideoSyn-1, which leverages a video-specific VAE (VidVAE) to compress spatiotemporal information and a Video Diffusion Transformer (VDiT) to synthesize video conditioned on text. A divide-and-merge strategy enables end-to-end generation of long videos (over 100 frames at 720p) while mitigating memory constraints, and a large-scale automated data pipeline yields over 13 million video-text pairs to train the system. The paper reports competitive quantitative performance against state-of-the-art T2V models, strong aesthetic and spatial quality, and a valuable data-processing framework with a dense-captioning video LLM. The combination of video-centric compression, transformer-based diffusion, and scalable data collection has practical impact for scalable, high-quality T2V synthesis in diverse styles and resolutions.

Abstract

We present xGen-VideoSyn-1, a text-to-video (T2V) generation model capable of producing realistic scenes from textual descriptions. Building on recent advancements, such as OpenAI's Sora, we explore the latent diffusion model (LDM) architecture and introduce a video variational autoencoder (VidVAE). VidVAE compresses video data both spatially and temporally, significantly reducing the length of visual tokens and the computational demands associated with generating long-sequence videos. To further address the computational costs, we propose a divide-and-merge strategy that maintains temporal consistency across video segments. Our Diffusion Transformer (DiT) model incorporates spatial and temporal self-attention layers, enabling robust generalization across different timeframes and aspect ratios. We have devised a data processing pipeline from the very beginning and collected over 13M high-quality video-text pairs. The pipeline includes multiple steps such as clipping, text detection, motion estimation, aesthetics scoring, and dense captioning based on our in-house video-LLM model. Training the VidVAE and DiT models required approximately 40 and 642 H100 days, respectively. Our model supports over 14-second 720p video generation in an end-to-end way and demonstrates competitive performance against state-of-the-art T2V models.

xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations

TL;DR

This work tackles the challenge of high-fidelity text-to-video generation with long sequences by introducing xGen-VideoSyn-1, which leverages a video-specific VAE (VidVAE) to compress spatiotemporal information and a Video Diffusion Transformer (VDiT) to synthesize video conditioned on text. A divide-and-merge strategy enables end-to-end generation of long videos (over 100 frames at 720p) while mitigating memory constraints, and a large-scale automated data pipeline yields over 13 million video-text pairs to train the system. The paper reports competitive quantitative performance against state-of-the-art T2V models, strong aesthetic and spatial quality, and a valuable data-processing framework with a dense-captioning video LLM. The combination of video-centric compression, transformer-based diffusion, and scalable data collection has practical impact for scalable, high-quality T2V synthesis in diverse styles and resolutions.

Abstract

We present xGen-VideoSyn-1, a text-to-video (T2V) generation model capable of producing realistic scenes from textual descriptions. Building on recent advancements, such as OpenAI's Sora, we explore the latent diffusion model (LDM) architecture and introduce a video variational autoencoder (VidVAE). VidVAE compresses video data both spatially and temporally, significantly reducing the length of visual tokens and the computational demands associated with generating long-sequence videos. To further address the computational costs, we propose a divide-and-merge strategy that maintains temporal consistency across video segments. Our Diffusion Transformer (DiT) model incorporates spatial and temporal self-attention layers, enabling robust generalization across different timeframes and aspect ratios. We have devised a data processing pipeline from the very beginning and collected over 13M high-quality video-text pairs. The pipeline includes multiple steps such as clipping, text detection, motion estimation, aesthetics scoring, and dense captioning based on our in-house video-LLM model. Training the VidVAE and DiT models required approximately 40 and 642 H100 days, respectively. Our model supports over 14-second 720p video generation in an end-to-end way and demonstrates competitive performance against state-of-the-art T2V models.
Paper Structure (24 sections, 3 equations, 16 figures, 5 tables)

This paper contains 24 sections, 3 equations, 16 figures, 5 tables.

Figures (16)

  • Figure 1: Example 720p text-to-video generation results by our xGen-VideoSyn-1 model.
  • Figure 2: The core of our xGen-VideoSyn-1 model is a Video DiT (VDiT) module and a new Video VAE (VidVAE) module. The key is that our Video VAE module is able to encode and compress long video sequences into a latent representation during training; as well as reconstruct and decode such latent representation into long and realistic video sequences during inference.
  • Figure 3: Detailed architecture of our proposed xGen-VideoSyn-1 model during training
  • Figure 4: Video latent extraction pipeline
  • Figure 5: Training data collection and processing pipeline
  • ...and 11 more figures