Table of Contents
Fetching ...

Lumina-Next: Making Lumina-T2X Stronger and Faster with Next-DiT

Le Zhuo, Ruoyi Du, Han Xiao, Yangguang Li, Dongyang Liu, Rongjie Huang, Wenze Liu, Lirui Zhao, Fu-Yun Wang, Zhanyu Ma, Xu Luo, Zehan Wang, Kaipeng Zhang, Xiangyang Zhu, Si Liu, Xiangyu Yue, Dingning Liu, Wanli Ouyang, Ziwei Liu, Yu Qiao, Hongsheng Li, Peng Gao

TL;DR

Lumina-Next advances Lumina-T2X by redesigning the backbone as Next-DiT with 3D RoPE, sandwich normalization, and grouped-query attention, paired with frequency- and time-aware RoPE for robust resolution extrapolation. It introduces optimized sampling schedules and a Time-Aware Context Drop to dramatically boost inference speed, enabling few-step, high-quality generation. The framework demonstrates strong zero-shot multilingual text-to-image capabilities and extends to multi-view, audio, music, and point-cloud generation, validating universal applicability. Together, these contributions deliver a scalable, tuning-free, cross-domain generative system with practical implications for broad AI content creation. The authors also provide extensive experimental comparisons and release all code and weights to foster reproducibility and further research.

Abstract

Lumina-T2X is a nascent family of Flow-based Large Diffusion Transformers that establishes a unified framework for transforming noise into various modalities, such as images and videos, conditioned on text instructions. Despite its promising capabilities, Lumina-T2X still encounters challenges including training instability, slow inference, and extrapolation artifacts. In this paper, we present Lumina-Next, an improved version of Lumina-T2X, showcasing stronger generation performance with increased training and inference efficiency. We begin with a comprehensive analysis of the Flag-DiT architecture and identify several suboptimal components, which we address by introducing the Next-DiT architecture with 3D RoPE and sandwich normalizations. To enable better resolution extrapolation, we thoroughly compare different context extrapolation methods applied to text-to-image generation with 3D RoPE, and propose Frequency- and Time-Aware Scaled RoPE tailored for diffusion transformers. Additionally, we introduced a sigmoid time discretization schedule to reduce sampling steps in solving the Flow ODE and the Context Drop method to merge redundant visual tokens for faster network evaluation, effectively boosting the overall sampling speed. Thanks to these improvements, Lumina-Next not only improves the quality and efficiency of basic text-to-image generation but also demonstrates superior resolution extrapolation capabilities and multilingual generation using decoder-based LLMs as the text encoder, all in a zero-shot manner. To further validate Lumina-Next as a versatile generative framework, we instantiate it on diverse tasks including visual recognition, multi-view, audio, music, and point cloud generation, showcasing strong performance across these domains. By releasing all codes and model weights, we aim to advance the development of next-generation generative AI capable of universal modeling.

Lumina-Next: Making Lumina-T2X Stronger and Faster with Next-DiT

TL;DR

Lumina-Next advances Lumina-T2X by redesigning the backbone as Next-DiT with 3D RoPE, sandwich normalization, and grouped-query attention, paired with frequency- and time-aware RoPE for robust resolution extrapolation. It introduces optimized sampling schedules and a Time-Aware Context Drop to dramatically boost inference speed, enabling few-step, high-quality generation. The framework demonstrates strong zero-shot multilingual text-to-image capabilities and extends to multi-view, audio, music, and point-cloud generation, validating universal applicability. Together, these contributions deliver a scalable, tuning-free, cross-domain generative system with practical implications for broad AI content creation. The authors also provide extensive experimental comparisons and release all code and weights to foster reproducibility and further research.

Abstract

Lumina-T2X is a nascent family of Flow-based Large Diffusion Transformers that establishes a unified framework for transforming noise into various modalities, such as images and videos, conditioned on text instructions. Despite its promising capabilities, Lumina-T2X still encounters challenges including training instability, slow inference, and extrapolation artifacts. In this paper, we present Lumina-Next, an improved version of Lumina-T2X, showcasing stronger generation performance with increased training and inference efficiency. We begin with a comprehensive analysis of the Flag-DiT architecture and identify several suboptimal components, which we address by introducing the Next-DiT architecture with 3D RoPE and sandwich normalizations. To enable better resolution extrapolation, we thoroughly compare different context extrapolation methods applied to text-to-image generation with 3D RoPE, and propose Frequency- and Time-Aware Scaled RoPE tailored for diffusion transformers. Additionally, we introduced a sigmoid time discretization schedule to reduce sampling steps in solving the Flow ODE and the Context Drop method to merge redundant visual tokens for faster network evaluation, effectively boosting the overall sampling speed. Thanks to these improvements, Lumina-Next not only improves the quality and efficiency of basic text-to-image generation but also demonstrates superior resolution extrapolation capabilities and multilingual generation using decoder-based LLMs as the text encoder, all in a zero-shot manner. To further validate Lumina-Next as a versatile generative framework, we instantiate it on diverse tasks including visual recognition, multi-view, audio, music, and point cloud generation, showcasing strong performance across these domains. By releasing all codes and model weights, we aim to advance the development of next-generation generative AI capable of universal modeling.

Paper Structure

This paper contains 46 sections, 5 equations, 23 figures, 8 tables, 1 algorithm.

Figures (23)

  • Figure 1: As a foundational generative framework, we demonstrate Lumina-Next's capabilities to generate high-resolution images, multi-view images, general audio and music, and 16K point clouds.
  • Figure 2: Architecture details of Flag-DiT and Next-DiT. The main improvements of Next-DiT include 3D RoPE, sandwich normalization, group-query attention, etc.
  • Figure 3: Visualization of attention score using (a) 1D RoPE and (b) 2D RoPE on images. We set the central point in the image as the anchor query.
  • Figure 4: Sandwich normalization effectively controls activation magnitudes over layers.
  • Figure 5: Discretization errors and local curvatures grow at the start and end of sampling.
  • ...and 18 more figures