Table of Contents
Fetching ...

NFIG: Multi-Scale Autoregressive Image Generation via Frequency Ordering

Zhihao Huang, Xi Qiu, Yukuo Ma, Yifu Zhou, Junjie Chen, Hongyuan Zhang, Chi Zhang, Xuelong Li

TL;DR

NFIG introduces a frequency-aware autoregressive framework that models images from low to high frequencies to align generation with the natural spectral structure. A Frequency-guided Residual-quantized VAE (FR-VAE) tokenizes images into multi-scale frequency components, which are then predicted by a frequency-aware Transformer in a coarse-to-fine sequence. Empirical results on ImageNet-256 show state-of-the-art AR performance (gFID 2.81, IS 332.42) and a 1.25x inference speedup, highlighting the efficiency and quality benefits of leveraging frequency priors. The work also provides ablations and analysis demonstrating the contribution of the FR-VAE tokenizer, loss design, and CFG-based generation, underscoring the practical impact for scalable, high-quality autoregressive image synthesis.

Abstract

Autoregressive models have achieved significant success in image generation. However, unlike the inherent hierarchical structure of image information in the spectral domain, standard autoregressive methods typically generate pixels sequentially in a fixed spatial order. To better leverage this spectral hierarchy, we introduce NextFrequency Image Generation (NFIG). NFIG is a novel framework that decomposes the image generation process into multiple frequency-guided stages. NFIG aligns the generation process with the natural image structure. It does this by first generating low-frequency components, which efficiently capture global structure with significantly fewer tokens, and then progressively adding higher-frequency details. This frequency-aware paradigm offers substantial advantages: it not only improves the quality of generated images but crucially reduces inference cost by efficiently establishing global structure early on. Extensive experiments on the ImageNet-256 benchmark validate NFIG's effectiveness, demonstrating superior performance (FID: 2.81) and a notable 1.25x speedup compared to the strong baseline VAR-d20. The source code is available at https://github.com/Pride-Huang/NFIG.

NFIG: Multi-Scale Autoregressive Image Generation via Frequency Ordering

TL;DR

NFIG introduces a frequency-aware autoregressive framework that models images from low to high frequencies to align generation with the natural spectral structure. A Frequency-guided Residual-quantized VAE (FR-VAE) tokenizes images into multi-scale frequency components, which are then predicted by a frequency-aware Transformer in a coarse-to-fine sequence. Empirical results on ImageNet-256 show state-of-the-art AR performance (gFID 2.81, IS 332.42) and a 1.25x inference speedup, highlighting the efficiency and quality benefits of leveraging frequency priors. The work also provides ablations and analysis demonstrating the contribution of the FR-VAE tokenizer, loss design, and CFG-based generation, underscoring the practical impact for scalable, high-quality autoregressive image synthesis.

Abstract

Autoregressive models have achieved significant success in image generation. However, unlike the inherent hierarchical structure of image information in the spectral domain, standard autoregressive methods typically generate pixels sequentially in a fixed spatial order. To better leverage this spectral hierarchy, we introduce NextFrequency Image Generation (NFIG). NFIG is a novel framework that decomposes the image generation process into multiple frequency-guided stages. NFIG aligns the generation process with the natural image structure. It does this by first generating low-frequency components, which efficiently capture global structure with significantly fewer tokens, and then progressively adding higher-frequency details. This frequency-aware paradigm offers substantial advantages: it not only improves the quality of generated images but crucially reduces inference cost by efficiently establishing global structure early on. Extensive experiments on the ImageNet-256 benchmark validate NFIG's effectiveness, demonstrating superior performance (FID: 2.81) and a notable 1.25x speedup compared to the strong baseline VAR-d20. The source code is available at https://github.com/Pride-Huang/NFIG.

Paper Structure

This paper contains 22 sections, 13 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Illustration of three autoregressive image generation frameworks. The figure demonstrates three prediction approaches: Next-Patch Prediction (patch-based progression), Next-Scale Prediction (coarse-to-fine resolution generation), and Next-Frequency Prediction (NFIG), which performs image generation by progressively predicting and synthesizing frequency components from low to high, resulting in a coarse-to-fine spatial reconstruction.
  • Figure 2: Overview of the Next-Frequency Image Generation (NFIG) Framework: (a) The Frequency-guided Residual-Quantization VAE encodes images into and decodes from frequency-guided residual quantized representations; (b) The image is decomposed into frequency components (low to high) and reconstructed progressively by merging these components for a coarse-to-fine process; (c) Next-Frequency Prediction model employs a frequency-aware Transformer to auto-regressively generate token sequences, with each block with same color representing a specific frequency band, enabling sequential image synthesis from low to high frequencies. $N^{tok}_{i}=h_iw_i$ is the number of image tokens used for the $i_{th}$ frequency band.
  • Figure 3: Generated $256\times 256$ examples by NFIG trained on Imagenet.
  • Figure 4: Generated images at different steps 2, 4, 6, 8, 10 of a 10-step process by FR-VAE, with corresponding frequency spectrum. In these spectrograms, brightness (red/yellow) indicates higher frequency energy while darker colors (blue) represent lower energy components. The center of each plot shows low-frequency information, with frequencies increasing radially outward, revealing the evolving distribution during the generation process.
  • Figure 5: Vector quantization loss comparison between NFIG and VAR across image scales.
  • ...and 1 more figures