Table of Contents
Fetching ...

Fast Autoregressive Models for Continuous Latent Generation

Tiankai Hang, Jianmin Bao, Fangyun Wei, Dong Chen

TL;DR

This work tackles the bottleneck of autoregressive generation in continuous latent spaces for high-fidelity image synthesis. It introduces FAR, a lightweight shortcut head that replaces MAR's diffusion head, enabling few-step, autoregressive sampling and direct integration with causal Transformers to operate in continuous latent space. Empirically, FAR achieves up to 2.3× faster inference than MAR with competitive FID/IS scores on ImageNet-256, and its larger variants closely match or surpass diffusion-based baselines while requiring fewer training epochs. The approach bridges efficiency and scalability in continuous-domain autoregressive modeling, offering practical benefits for real-time and large-scale visual generation without the need for discrete tokenizers.

Abstract

Autoregressive models have demonstrated remarkable success in sequential data generation, particularly in NLP, but their extension to continuous-domain image generation presents significant challenges. Recent work, the masked autoregressive model (MAR), bypasses quantization by modeling per-token distributions in continuous spaces using a diffusion head but suffers from slow inference due to the high computational cost of the iterative denoising process. To address this, we propose the Fast AutoRegressive model (FAR), a novel framework that replaces MAR's diffusion head with a lightweight shortcut head, enabling efficient few-step sampling while preserving autoregressive principles. Additionally, FAR seamlessly integrates with causal Transformers, extending them from discrete to continuous token generation without requiring architectural modifications. Experiments demonstrate that FAR achieves $2.3\times$ faster inference than MAR while maintaining competitive FID and IS scores. This work establishes the first efficient autoregressive paradigm for high-fidelity continuous-space image generation, bridging the critical gap between quality and scalability in visual autoregressive modeling.

Fast Autoregressive Models for Continuous Latent Generation

TL;DR

This work tackles the bottleneck of autoregressive generation in continuous latent spaces for high-fidelity image synthesis. It introduces FAR, a lightweight shortcut head that replaces MAR's diffusion head, enabling few-step, autoregressive sampling and direct integration with causal Transformers to operate in continuous latent space. Empirically, FAR achieves up to 2.3× faster inference than MAR with competitive FID/IS scores on ImageNet-256, and its larger variants closely match or surpass diffusion-based baselines while requiring fewer training epochs. The approach bridges efficiency and scalability in continuous-domain autoregressive modeling, offering practical benefits for real-time and large-scale visual generation without the need for discrete tokenizers.

Abstract

Autoregressive models have demonstrated remarkable success in sequential data generation, particularly in NLP, but their extension to continuous-domain image generation presents significant challenges. Recent work, the masked autoregressive model (MAR), bypasses quantization by modeling per-token distributions in continuous spaces using a diffusion head but suffers from slow inference due to the high computational cost of the iterative denoising process. To address this, we propose the Fast AutoRegressive model (FAR), a novel framework that replaces MAR's diffusion head with a lightweight shortcut head, enabling efficient few-step sampling while preserving autoregressive principles. Additionally, FAR seamlessly integrates with causal Transformers, extending them from discrete to continuous token generation without requiring architectural modifications. Experiments demonstrate that FAR achieves faster inference than MAR while maintaining competitive FID and IS scores. This work establishes the first efficient autoregressive paradigm for high-fidelity continuous-space image generation, bridging the critical gap between quality and scalability in visual autoregressive modeling.

Paper Structure

This paper contains 15 sections, 8 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Inference cost breakdown and efficiency comparison among FAR, MAR li2025mar, and DiT peebles2023dit for generating a $256 \times 256$ resolution image. Both FAR and MAR utilize the same encoder-decoder architecture, comprising 24 Transformer blocks with 172M parameters, along with a 6-layer MLP as the head network. In comparison, DiT (DiT-XL version), which achieves similar performance, features 28 Transformer blocks and 676M parameters. In MAR, the head network is the primary computational bottleneck, accounting for the majority of the inference cost. FAR mitigates this issue by introducing a more efficient head network that requires fewer denoising steps, achieving up to $2.3\times$ acceleration over MAR while maintaining nearly identical performance on ImageNet deng2009imagenet generation.
  • Figure 2: (a) FAR introduces a shortcut head that could replace the high-cost diffusion-based head in MAR, significantly reducing the inference cost while preserving the autoregressive principle and maintaining performance. (b-c) Integration of the FAR head enables a causal Transformer to transition from operating in a discrete space to a continuous space for image generation.
  • Figure 3: Architecture of the FAR head, a shortcut-based frans2024one network. The network processes a noisy token as input and produces a denoised output, guided jointly by a condition from the backbone, a desired step size, and a denoising timestep.
  • Figure 4: Analysis on the proportion of inference cost attributed to the head network in FAR (left) and FAR-Causal (right) under various settings. Left: Evaluation of three FAR variants, each using a different number of iterations ($K=32,64,256$) to generate an image. For each variant, we analyze the head network cost ratio by varying the number of denoising steps ($O=2,8,25,50,100$) required by the head network per image token generation. Right: For each variant, we examine how the proportion of inference cost attributed to the head network changes as the number of denoising steps per image token generation varies.
  • Figure 5: The impact of different CFG weights on FID for FAR-B under various configurations of autoregressive iterations ($K=32, 256$) and denoising steps ($O=1, 8$) per token generation.
  • ...and 1 more figures