Table of Contents
Fetching ...

Scalable Autoregressive Image Generation with Mamba

Haopeng Li, Jinyue Yang, Kexin Wang, Xuerui Qiu, Yuhong Chou, Xin Li, Guoqi Li

TL;DR

AiM introduces the first autoregressive image generator based on the Mamba state-space backbone, achieving high-quality class-conditional image synthesis with linear-time sequence modeling. By adding simple yet effective adaptations—positional encoding and adaLN-group—and leveraging classifier-free guidance, AiM delivers state-of-the-art results among AR models on ImageNet-256 with FID as low as 2.21 and significantly faster inference than diffusion methods. The two-stage training pipeline (tokenizer/decoder followed by causal sequence modeling) preserves Mamba’s efficiency while scaling across multiple model sizes, demonstrating strong scalability with larger parameter counts and longer training. This work highlights the practical viability of Mamba for visual generation and opens avenues for further exploration of text-to-image and more efficient autoregressive strategies in vision.

Abstract

We introduce AiM, an autoregressive (AR) image generative model based on Mamba architecture. AiM employs Mamba, a novel state-space model characterized by its exceptional performance for long-sequence modeling with linear time complexity, to supplant the commonly utilized Transformers in AR image generation models, aiming to achieve both superior generation quality and enhanced inference speed. Unlike existing methods that adapt Mamba to handle two-dimensional signals via multi-directional scan, AiM directly utilizes the next-token prediction paradigm for autoregressive image generation. This approach circumvents the need for extensive modifications to enable Mamba to learn 2D spatial representations. By implementing straightforward yet strategically targeted modifications for visual generative tasks, we preserve Mamba's core structure, fully exploiting its efficient long-sequence modeling capabilities and scalability. We provide AiM models in various scales, with parameter counts ranging from 148M to 1.3B. On the ImageNet1K 256*256 benchmark, our best AiM model achieves a FID of 2.21, surpassing all existing AR models of comparable parameter counts and demonstrating significant competitiveness against diffusion models, with 2 to 10 times faster inference speed. Code is available at https://github.com/hp-l33/AiM

Scalable Autoregressive Image Generation with Mamba

TL;DR

AiM introduces the first autoregressive image generator based on the Mamba state-space backbone, achieving high-quality class-conditional image synthesis with linear-time sequence modeling. By adding simple yet effective adaptations—positional encoding and adaLN-group—and leveraging classifier-free guidance, AiM delivers state-of-the-art results among AR models on ImageNet-256 with FID as low as 2.21 and significantly faster inference than diffusion methods. The two-stage training pipeline (tokenizer/decoder followed by causal sequence modeling) preserves Mamba’s efficiency while scaling across multiple model sizes, demonstrating strong scalability with larger parameter counts and longer training. This work highlights the practical viability of Mamba for visual generation and opens avenues for further exploration of text-to-image and more efficient autoregressive strategies in vision.

Abstract

We introduce AiM, an autoregressive (AR) image generative model based on Mamba architecture. AiM employs Mamba, a novel state-space model characterized by its exceptional performance for long-sequence modeling with linear time complexity, to supplant the commonly utilized Transformers in AR image generation models, aiming to achieve both superior generation quality and enhanced inference speed. Unlike existing methods that adapt Mamba to handle two-dimensional signals via multi-directional scan, AiM directly utilizes the next-token prediction paradigm for autoregressive image generation. This approach circumvents the need for extensive modifications to enable Mamba to learn 2D spatial representations. By implementing straightforward yet strategically targeted modifications for visual generative tasks, we preserve Mamba's core structure, fully exploiting its efficient long-sequence modeling capabilities and scalability. We provide AiM models in various scales, with parameter counts ranging from 148M to 1.3B. On the ImageNet1K 256*256 benchmark, our best AiM model achieves a FID of 2.21, surpassing all existing AR models of comparable parameter counts and demonstrating significant competitiveness against diffusion models, with 2 to 10 times faster inference speed. Code is available at https://github.com/hp-l33/AiM
Paper Structure (25 sections, 10 equations, 8 figures, 4 tables)

This paper contains 25 sections, 10 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Autoregressive Image Generation with Mamba. We show samples from our class-conditional AiM-XL model trained on ImageNet at 256$\times$256 resolution.
  • Figure 2: AR image generation pipeline.Stage 1: Training the image tokenizer (encoder and quantizer) and decoder via image reconstruction. Stage 2: Training the AR model through causal sequence modeling. The symbol $\langle\text{C}\rangle$ represents the class embedding. Inference: Generating image tokens autoregressively by predicting the next token, which the decoder then converts into a synthesized image. The lock icon: Frozen weights.
  • Figure 3: The cause of mirror artifact in synthesized images. The regions boxed in normal image and mirror mrtifact image maintain the same token sequence after flattening.
  • Figure 4: The impact of positional encoding. Without positional encoding, the model is prone to generating images with mirrored artifacts, as observed in the first row.
  • Figure 5: Architectural details of the AiM model. Our adaLN-group represents a more generalized form of both adaLN (when the number of groups equals the number of layers) and adaLN-single (when there is only one group)
  • ...and 3 more figures