Frequency Autoregressive Image Generation with Continuous Tokens
Hu Yu, Hao Luo, Hangjie Yuan, Yu Rong, Feng Zhao
TL;DR
This work introduces Frequency Progressive Autoregressive (FAR), a paradigm for autoregressive image generation that regresses across increasing frequency levels using a continuous tokenizer. By leveraging spectral dependencies, FAR maintains image priors and spatial locality while significantly improving efficiency through frequency-based autoregression, a diffusion-based loss for continuous tokens, and techniques like masking and frequency-aware sampling. The approach achieves competitive quality with far fewer inference steps on ImageNet and demonstrates strong potential for text-to-image generation with smaller models and data footprints. Ablations validate key components—diffusion-loss simplification, mask mechanism, and frequency-aware training—highlighting FAR's scalability and efficiency as a path toward more unified vision-language generation models.
Abstract
Autoregressive (AR) models for image generation typically adopt a two-stage paradigm of vector quantization and raster-scan ``next-token prediction", inspired by its great success in language modeling. However, due to the huge modality gap, image autoregressive models may require a systematic reevaluation from two perspectives: tokenizer format and regression direction. In this paper, we introduce the frequency progressive autoregressive (\textbf{FAR}) paradigm and instantiate FAR with the continuous tokenizer. Specifically, we identify spectral dependency as the desirable regression direction for FAR, wherein higher-frequency components build upon the lower one to progressively construct a complete image. This design seamlessly fits the causality requirement for autoregressive models and preserves the unique spatial locality of image data. Besides, we delve into the integration of FAR and the continuous tokenizer, introducing a series of techniques to address optimization challenges and improve the efficiency of training and inference processes. We demonstrate the efficacy of FAR through comprehensive experiments on the ImageNet dataset and verify its potential on text-to-image generation.
