Fast Autoregressive Models for Continuous Latent Generation
Tiankai Hang, Jianmin Bao, Fangyun Wei, Dong Chen
TL;DR
This work tackles the bottleneck of autoregressive generation in continuous latent spaces for high-fidelity image synthesis. It introduces FAR, a lightweight shortcut head that replaces MAR's diffusion head, enabling few-step, autoregressive sampling and direct integration with causal Transformers to operate in continuous latent space. Empirically, FAR achieves up to 2.3× faster inference than MAR with competitive FID/IS scores on ImageNet-256, and its larger variants closely match or surpass diffusion-based baselines while requiring fewer training epochs. The approach bridges efficiency and scalability in continuous-domain autoregressive modeling, offering practical benefits for real-time and large-scale visual generation without the need for discrete tokenizers.
Abstract
Autoregressive models have demonstrated remarkable success in sequential data generation, particularly in NLP, but their extension to continuous-domain image generation presents significant challenges. Recent work, the masked autoregressive model (MAR), bypasses quantization by modeling per-token distributions in continuous spaces using a diffusion head but suffers from slow inference due to the high computational cost of the iterative denoising process. To address this, we propose the Fast AutoRegressive model (FAR), a novel framework that replaces MAR's diffusion head with a lightweight shortcut head, enabling efficient few-step sampling while preserving autoregressive principles. Additionally, FAR seamlessly integrates with causal Transformers, extending them from discrete to continuous token generation without requiring architectural modifications. Experiments demonstrate that FAR achieves $2.3\times$ faster inference than MAR while maintaining competitive FID and IS scores. This work establishes the first efficient autoregressive paradigm for high-fidelity continuous-space image generation, bridging the critical gap between quality and scalability in visual autoregressive modeling.
