Table of Contents
Fetching ...

E-CAR: Efficient Continuous Autoregressive Image Generation via Multistage Modeling

Zhihang Yuan, Yuzhang Shang, Hanling Zhang, Tongcheng Fang, Rui Xie, Bingxin Xu, Yan Yan, Shengen Yan, Guohao Dai, Yu Wang

TL;DR

ECAR tackles efficiency bottlenecks in continuous autoregressive image generation by coupling stage-wise token maps with multistage flow-based detokenization. The two core innovations enable parallel token sampling and partial denoising across resolutions, yielding substantial compute reductions without sacrificing quality. On 256×256 ImageNet-scale data, ECAR achieves competitive image fidelity with roughly a tenfold FLOP reduction and a fivefold speedup, demonstrating practical viability for fast, high-resolution generation. This work integrates continuous AR with pyramidal flow concepts to deliver scalable, high-quality visual synthesis.

Abstract

Recent advances in autoregressive (AR) models with continuous tokens for image generation show promising results by eliminating the need for discrete tokenization. However, these models face efficiency challenges due to their sequential token generation nature and reliance on computationally intensive diffusion-based sampling. We present ECAR (Efficient Continuous Auto-Regressive Image Generation via Multistage Modeling), an approach that addresses these limitations through two intertwined innovations: (1) a stage-wise continuous token generation strategy that reduces computational complexity and provides progressively refined token maps as hierarchical conditions, and (2) a multistage flow-based distribution modeling method that transforms only partial-denoised distributions at each stage comparing to complete denoising in normal diffusion models. Holistically, ECAR operates by generating tokens at increasing resolutions while simultaneously denoising the image at each stage. This design not only reduces token-to-image transformation cost by a factor of the stage number but also enables parallel processing at the token level. Our approach not only enhances computational efficiency but also aligns naturally with image generation principles by operating in continuous token space and following a hierarchical generation process from coarse to fine details. Experimental results demonstrate that ECAR achieves comparable image quality to DiT Peebles & Xie [2023] while requiring 10$\times$ FLOPs reduction and 5$\times$ speedup to generate a 256$\times$256 image.

E-CAR: Efficient Continuous Autoregressive Image Generation via Multistage Modeling

TL;DR

ECAR tackles efficiency bottlenecks in continuous autoregressive image generation by coupling stage-wise token maps with multistage flow-based detokenization. The two core innovations enable parallel token sampling and partial denoising across resolutions, yielding substantial compute reductions without sacrificing quality. On 256×256 ImageNet-scale data, ECAR achieves competitive image fidelity with roughly a tenfold FLOP reduction and a fivefold speedup, demonstrating practical viability for fast, high-resolution generation. This work integrates continuous AR with pyramidal flow concepts to deliver scalable, high-quality visual synthesis.

Abstract

Recent advances in autoregressive (AR) models with continuous tokens for image generation show promising results by eliminating the need for discrete tokenization. However, these models face efficiency challenges due to their sequential token generation nature and reliance on computationally intensive diffusion-based sampling. We present ECAR (Efficient Continuous Auto-Regressive Image Generation via Multistage Modeling), an approach that addresses these limitations through two intertwined innovations: (1) a stage-wise continuous token generation strategy that reduces computational complexity and provides progressively refined token maps as hierarchical conditions, and (2) a multistage flow-based distribution modeling method that transforms only partial-denoised distributions at each stage comparing to complete denoising in normal diffusion models. Holistically, ECAR operates by generating tokens at increasing resolutions while simultaneously denoising the image at each stage. This design not only reduces token-to-image transformation cost by a factor of the stage number but also enables parallel processing at the token level. Our approach not only enhances computational efficiency but also aligns naturally with image generation principles by operating in continuous token space and following a hierarchical generation process from coarse to fine details. Experimental results demonstrate that ECAR achieves comparable image quality to DiT Peebles & Xie [2023] while requiring 10 FLOPs reduction and 5 speedup to generate a 256256 image.

Paper Structure

This paper contains 18 sections, 25 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: High-level Idea of E- CAR. The model progressively generates tokens at increasing resolutions, while correspondingly denoising the image at each stage. By combining stage-by-stage continuous token generation with multistage flow-based image synthesis, E- CAR achieves efficient continuous autoregressive image generation while maintaining high visual quality.
  • Figure 2: (a) Diffusion/Flow-matching model: Generates images through multiple iterations of denoising/velocity network inference. (b) Traditional AR Transformer: Sequentially generates discrete tokens, followed by codebook-based detokenization. (c) Continuous Masked AR li2024autoregressive: Sequentially produces continuous tokens, which are transformed into image patches via a diffusion model. (d)E- CAR: Introduces two key innovations: multi-stage continuous token generation and (Sec.\ref{['subsec:flow']}) multi-stage flow for efficient continuous token generation and token-to-image detokenization, respectively. Using the upsample and renoise technique jin2024pyramidal, we can correspondingly reduce the number of steps for flow matching at each stage, enhancing the efficiency of the continuous token detokenization process.
  • Figure 3: Training of E- CAR. Our model combines multistage autoregressive token generation with progressive flow matching. The AR transformer (left) generates continuous token maps using a multistage causal attention mask, which are then transformed to spatial conditions for each stage. Each stage's token map conditions its corresponding flow model, enabling progressive reconstruction of image latents at different resolutions. The flow matching loss is computed between the predicted velocity and the ground truth trajectory at each stage, with back-propagation through the entire pipeline for end-to-end training.
  • Figure 4: Samples from different models with the same noise.
  • Figure 5: Ablation study of AR.
  • ...and 1 more figures