Table of Contents
Fetching ...

Lumina-mGPT 2.0: Stand-Alone AutoRegressive Image Modeling

Yi Xin, Juncheng Yan, Qi Qin, Zhen Li, Dongyang Liu, Shicheng Li, Victor Shea-Jay Huang, Yupeng Zhou, Renrui Zhang, Le Zhuo, Tiancheng Han, Xiaoqing Sun, Siqi Luo, Mengmeng Wang, Bin Fu, Yuewen Cao, Hongsheng Li, Guangtao Zhai, Xiaohong Liu, Yu Qiao, Peng Gao

TL;DR

Lumina-mGPT 2.0 introduces a stand-alone, decoder-only autoregressive image model trained from scratch to unify multiple generation tasks within a single framework. It reaches diffusion-model–level quality while providing licensing freedom and architectural flexibility, aided by a unified tokenization and a raster-scan generation strategy. The paper also proposes inference-time innovations—thinking-before-generation via an LLM, best-of-N sampling with verifiers, and acceleration through quantization and Speculative Jacobi Decoding—to boost both quality and speed. Extensive GenEval/DPG evaluations and ablations demonstrate competitive quantitative performance and strong qualitative results, along with demonstrated multitask capabilities such as subject-driven generation, image editing, controllable generation, and dense prediction. Limitations include substantial sampling times and reliance on an external LLM for prompt thinking, with future work aimed at autonomous thinking and expanded multimodal understanding.

Abstract

We present Lumina-mGPT 2.0, a stand-alone, decoder-only autoregressive model that revisits and revitalizes the autoregressive paradigm for high-quality image generation and beyond. Unlike existing approaches that rely on pretrained components or hybrid architectures, Lumina-mGPT 2.0 is trained entirely from scratch, enabling unrestricted architectural design and licensing freedom. It achieves generation quality on par with state-of-the-art diffusion models such as DALL-E 3 and SANA, while preserving the inherent flexibility and compositionality of autoregressive modeling. Our unified tokenization scheme allows the model to seamlessly handle a wide spectrum of tasks-including subject-driven generation, image editing, controllable synthesis, and dense prediction-within a single generative framework. To further boost usability, we incorporate efficient decoding strategies like inference-time scaling and speculative Jacobi sampling to improve quality and speed, respectively. Extensive evaluations on standard text-to-image benchmarks (e.g., GenEval, DPG) demonstrate that Lumina-mGPT 2.0 not only matches but in some cases surpasses diffusion-based models. Moreover, we confirm its multi-task capabilities on the Graph200K benchmark, with the native Lumina-mGPT 2.0 performing exceptionally well. These results position Lumina-mGPT 2.0 as a strong, flexible foundation model for unified multimodal generation. We have released our training details, code, and models at https://github.com/Alpha-VLLM/Lumina-mGPT-2.0.

Lumina-mGPT 2.0: Stand-Alone AutoRegressive Image Modeling

TL;DR

Lumina-mGPT 2.0 introduces a stand-alone, decoder-only autoregressive image model trained from scratch to unify multiple generation tasks within a single framework. It reaches diffusion-model–level quality while providing licensing freedom and architectural flexibility, aided by a unified tokenization and a raster-scan generation strategy. The paper also proposes inference-time innovations—thinking-before-generation via an LLM, best-of-N sampling with verifiers, and acceleration through quantization and Speculative Jacobi Decoding—to boost both quality and speed. Extensive GenEval/DPG evaluations and ablations demonstrate competitive quantitative performance and strong qualitative results, along with demonstrated multitask capabilities such as subject-driven generation, image editing, controllable generation, and dense prediction. Limitations include substantial sampling times and reliance on an external LLM for prompt thinking, with future work aimed at autonomous thinking and expanded multimodal understanding.

Abstract

We present Lumina-mGPT 2.0, a stand-alone, decoder-only autoregressive model that revisits and revitalizes the autoregressive paradigm for high-quality image generation and beyond. Unlike existing approaches that rely on pretrained components or hybrid architectures, Lumina-mGPT 2.0 is trained entirely from scratch, enabling unrestricted architectural design and licensing freedom. It achieves generation quality on par with state-of-the-art diffusion models such as DALL-E 3 and SANA, while preserving the inherent flexibility and compositionality of autoregressive modeling. Our unified tokenization scheme allows the model to seamlessly handle a wide spectrum of tasks-including subject-driven generation, image editing, controllable synthesis, and dense prediction-within a single generative framework. To further boost usability, we incorporate efficient decoding strategies like inference-time scaling and speculative Jacobi sampling to improve quality and speed, respectively. Extensive evaluations on standard text-to-image benchmarks (e.g., GenEval, DPG) demonstrate that Lumina-mGPT 2.0 not only matches but in some cases surpasses diffusion-based models. Moreover, we confirm its multi-task capabilities on the Graph200K benchmark, with the native Lumina-mGPT 2.0 performing exceptionally well. These results position Lumina-mGPT 2.0 as a strong, flexible foundation model for unified multimodal generation. We have released our training details, code, and models at https://github.com/Alpha-VLLM/Lumina-mGPT-2.0.

Paper Structure

This paper contains 18 sections, 11 figures, 7 tables.

Figures (11)

  • Figure 1: Multitask Capabilities of Lumina-mGPT 2.0. A stand-alone, decoder-only autoregressive model, trained from scratch, that unifies a broad spectrum of image generation tasks.
  • Figure 2: Decoder-only Transformer Architecture of Lumina-mGPT 2.0. This architecture utilizes autoregressive modeling for image synthesis and supports conditional image input, facilitating a wide range of generation tasks through the integration of both text prompts and optional reference images.
  • Figure 3: Image Reconstruction Results of Different Image Tokenizers. We present the specific details of selected areas, highlighted with red boxes, to demonstrate the performance of various image tokenizers.
  • Figure 4: Unifying Diverse Generation Tasks with Autoregressive Raster-Scan Scheme. The model generates the upper half of the image first (or the reference image given during the inference), which serves as contextual guidance for the generation of the lower half.
  • Figure 5: Pipeline for High-quality Sampling. We begin by thinking the user's prompt, elaborating upon it to enhance clarity and coherence. Subsequently, we employ the best-of-N strategy to select the optimal image from the generated candidates.
  • ...and 6 more figures