Table of Contents
Fetching ...

Frequency Autoregressive Image Generation with Continuous Tokens

Hu Yu, Hao Luo, Hangjie Yuan, Yu Rong, Feng Zhao

TL;DR

This work introduces Frequency Progressive Autoregressive (FAR), a paradigm for autoregressive image generation that regresses across increasing frequency levels using a continuous tokenizer. By leveraging spectral dependencies, FAR maintains image priors and spatial locality while significantly improving efficiency through frequency-based autoregression, a diffusion-based loss for continuous tokens, and techniques like masking and frequency-aware sampling. The approach achieves competitive quality with far fewer inference steps on ImageNet and demonstrates strong potential for text-to-image generation with smaller models and data footprints. Ablations validate key components—diffusion-loss simplification, mask mechanism, and frequency-aware training—highlighting FAR's scalability and efficiency as a path toward more unified vision-language generation models.

Abstract

Autoregressive (AR) models for image generation typically adopt a two-stage paradigm of vector quantization and raster-scan ``next-token prediction", inspired by its great success in language modeling. However, due to the huge modality gap, image autoregressive models may require a systematic reevaluation from two perspectives: tokenizer format and regression direction. In this paper, we introduce the frequency progressive autoregressive (\textbf{FAR}) paradigm and instantiate FAR with the continuous tokenizer. Specifically, we identify spectral dependency as the desirable regression direction for FAR, wherein higher-frequency components build upon the lower one to progressively construct a complete image. This design seamlessly fits the causality requirement for autoregressive models and preserves the unique spatial locality of image data. Besides, we delve into the integration of FAR and the continuous tokenizer, introducing a series of techniques to address optimization challenges and improve the efficiency of training and inference processes. We demonstrate the efficacy of FAR through comprehensive experiments on the ImageNet dataset and verify its potential on text-to-image generation.

Frequency Autoregressive Image Generation with Continuous Tokens

TL;DR

This work introduces Frequency Progressive Autoregressive (FAR), a paradigm for autoregressive image generation that regresses across increasing frequency levels using a continuous tokenizer. By leveraging spectral dependencies, FAR maintains image priors and spatial locality while significantly improving efficiency through frequency-based autoregression, a diffusion-based loss for continuous tokens, and techniques like masking and frequency-aware sampling. The approach achieves competitive quality with far fewer inference steps on ImageNet and demonstrates strong potential for text-to-image generation with smaller models and data footprints. Ablations validate key components—diffusion-loss simplification, mask mechanism, and frequency-aware training—highlighting FAR's scalability and efficiency as a path toward more unified vision-language generation models.

Abstract

Autoregressive (AR) models for image generation typically adopt a two-stage paradigm of vector quantization and raster-scan ``next-token prediction", inspired by its great success in language modeling. However, due to the huge modality gap, image autoregressive models may require a systematic reevaluation from two perspectives: tokenizer format and regression direction. In this paper, we introduce the frequency progressive autoregressive (\textbf{FAR}) paradigm and instantiate FAR with the continuous tokenizer. Specifically, we identify spectral dependency as the desirable regression direction for FAR, wherein higher-frequency components build upon the lower one to progressively construct a complete image. This design seamlessly fits the causality requirement for autoregressive models and preserves the unique spatial locality of image data. Besides, we delve into the integration of FAR and the continuous tokenizer, introducing a series of techniques to address optimization challenges and improve the efficiency of training and inference processes. We demonstrate the efficacy of FAR through comprehensive experiments on the ImageNet dataset and verify its potential on text-to-image generation.

Paper Structure

This paper contains 23 sections, 5 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Regression direction paradigms in AR models for image generation. (a) Vanilla AR: sequential next-token generation in a raster-scan order, from left to right, top to bottom; (b) Masked-AR: next-set prediction with random order, generating multiple tokens each step; (c) VAR: combines RQ-VAE and multi-scale, adding all scales to get the final prediction and necessitating customized multi-scale discrete tokenizer: (d) Ours FAR. We propose the next-frequency prediction paradigm leveraging the spectral dependency prior.
  • Figure 2: Sampling steps of FAR and diffusion loss.
  • Figure 3: Visual comparisons with the representative MAR and VAR methods with 10 inference steps. Thanks to the intrinsic harmony with image data, our FAR can generate high-quality images with consistent structures and fine details with only 10 steps.
  • Figure 4: More visual results of the text-to-image autoregressive generation at 256x256 resolution.
  • Figure 5: Image reconstruction performance comparison between continuous and discrete tokenizers under different spatial compression ratios (f=8 and f=16). Constrained by their finite vocabulary codebooks, discrete tokenizers suffer from significant information loss, struggling to faithfully reconstruct images with intricate, high-frequency details such as human faces. Note that the reconstruction of continuous tokenizer at f=16 is still better than the discrete one at f=8, which is also consistent with the rate distortion theory.
  • ...and 5 more figures