Table of Contents
Fetching ...

Lumina-Video: Efficient and Flexible Video Generation with Multi-scale Next-DiT

Dongyang Liu, Shicheng Li, Yutong Liu, Zhen Li, Kai Wang, Xinyue Li, Qi Qin, Yufei Liu, Yi Xin, Zhongyu Li, Bin Fu, Chenyang Si, Yuewen Cao, Conghui He, Ziwei Liu, Yu Qiao, Qibin Hou, Hongsheng Li, Peng Gao

TL;DR

Lumina-Video extends Next-DiT to video by introducing a Multi-scale Next-DiT backbone with multiple patch scales, shared across scales, and explicit motion conditioning via a motion score. A scale-aware training scheme with progressive data and multi-source prompts, plus a video-to-audio extension Lumina-V2A, enables high-quality, temporally coherent video generation with flexible inference and audio synchronization. Ablations and benchmarks on VBench show competitive performance and a favorable efficiency–quality trade-off, with clear benefits from patch-scale diversity and motion conditioning. The work offers a practical, open framework for scalable video generation and sets the stage for broader multimodal generation and deployment.

Abstract

Recent advancements have established Diffusion Transformers (DiTs) as a dominant framework in generative modeling. Building on this success, Lumina-Next achieves exceptional performance in the generation of photorealistic images with Next-DiT. However, its potential for video generation remains largely untapped, with significant challenges in modeling the spatiotemporal complexity inherent to video data. To address this, we introduce Lumina-Video, a framework that leverages the strengths of Next-DiT while introducing tailored solutions for video synthesis. Lumina-Video incorporates a Multi-scale Next-DiT architecture, which jointly learns multiple patchifications to enhance both efficiency and flexibility. By incorporating the motion score as an explicit condition, Lumina-Video also enables direct control of generated videos' dynamic degree. Combined with a progressive training scheme with increasingly higher resolution and FPS, and a multi-source training scheme with mixed natural and synthetic data, Lumina-Video achieves remarkable aesthetic quality and motion smoothness at high training and inference efficiency. We additionally propose Lumina-V2A, a video-to-audio model based on Next-DiT, to create synchronized sounds for generated videos. Codes are released at https://www.github.com/Alpha-VLLM/Lumina-Video.

Lumina-Video: Efficient and Flexible Video Generation with Multi-scale Next-DiT

TL;DR

Lumina-Video extends Next-DiT to video by introducing a Multi-scale Next-DiT backbone with multiple patch scales, shared across scales, and explicit motion conditioning via a motion score. A scale-aware training scheme with progressive data and multi-source prompts, plus a video-to-audio extension Lumina-V2A, enables high-quality, temporally coherent video generation with flexible inference and audio synchronization. Ablations and benchmarks on VBench show competitive performance and a favorable efficiency–quality trade-off, with clear benefits from patch-scale diversity and motion conditioning. The work offers a practical, open framework for scalable video generation and sets the stage for broader multimodal generation and deployment.

Abstract

Recent advancements have established Diffusion Transformers (DiTs) as a dominant framework in generative modeling. Building on this success, Lumina-Next achieves exceptional performance in the generation of photorealistic images with Next-DiT. However, its potential for video generation remains largely untapped, with significant challenges in modeling the spatiotemporal complexity inherent to video data. To address this, we introduce Lumina-Video, a framework that leverages the strengths of Next-DiT while introducing tailored solutions for video synthesis. Lumina-Video incorporates a Multi-scale Next-DiT architecture, which jointly learns multiple patchifications to enhance both efficiency and flexibility. By incorporating the motion score as an explicit condition, Lumina-Video also enables direct control of generated videos' dynamic degree. Combined with a progressive training scheme with increasingly higher resolution and FPS, and a multi-source training scheme with mixed natural and synthetic data, Lumina-Video achieves remarkable aesthetic quality and motion smoothness at high training and inference efficiency. We additionally propose Lumina-V2A, a video-to-audio model based on Next-DiT, to create synchronized sounds for generated videos. Codes are released at https://www.github.com/Alpha-VLLM/Lumina-Video.

Paper Structure

This paper contains 35 sections, 6 equations, 8 figures, 4 tables, 1 algorithm.

Figures (8)

  • Figure 1: Lumina-Video demonstrates a strong ability to generate high-quality videos with rich details and remarkable temporal coherence, accurately following both simple and detailed text prompts.
  • Figure 2: Architecture of Lumina-Video with Multi-scale Next-DiT and Motion Conditioning.
  • Figure 3: Loss curves for different patch sizes at different denoising timesteps. See Sec. \ref{['sec:full-loss-bin']} for the complete figure.
  • Figure 4: Multi-scale Patchification allows Lumina-Video to perform flexible multi-stage denoising during inference, leading to a better tradeoff between quality and efficiency.
  • Figure 5: Comparison of generated videos using different patchification strategies.
  • ...and 3 more figures