NFIG: Multi-Scale Autoregressive Image Generation via Frequency Ordering
Zhihao Huang, Xi Qiu, Yukuo Ma, Yifu Zhou, Junjie Chen, Hongyuan Zhang, Chi Zhang, Xuelong Li
TL;DR
NFIG introduces a frequency-aware autoregressive framework that models images from low to high frequencies to align generation with the natural spectral structure. A Frequency-guided Residual-quantized VAE (FR-VAE) tokenizes images into multi-scale frequency components, which are then predicted by a frequency-aware Transformer in a coarse-to-fine sequence. Empirical results on ImageNet-256 show state-of-the-art AR performance (gFID 2.81, IS 332.42) and a 1.25x inference speedup, highlighting the efficiency and quality benefits of leveraging frequency priors. The work also provides ablations and analysis demonstrating the contribution of the FR-VAE tokenizer, loss design, and CFG-based generation, underscoring the practical impact for scalable, high-quality autoregressive image synthesis.
Abstract
Autoregressive models have achieved significant success in image generation. However, unlike the inherent hierarchical structure of image information in the spectral domain, standard autoregressive methods typically generate pixels sequentially in a fixed spatial order. To better leverage this spectral hierarchy, we introduce NextFrequency Image Generation (NFIG). NFIG is a novel framework that decomposes the image generation process into multiple frequency-guided stages. NFIG aligns the generation process with the natural image structure. It does this by first generating low-frequency components, which efficiently capture global structure with significantly fewer tokens, and then progressively adding higher-frequency details. This frequency-aware paradigm offers substantial advantages: it not only improves the quality of generated images but crucially reduces inference cost by efficiently establishing global structure early on. Extensive experiments on the ImageNet-256 benchmark validate NFIG's effectiveness, demonstrating superior performance (FID: 2.81) and a notable 1.25x speedup compared to the strong baseline VAR-d20. The source code is available at https://github.com/Pride-Huang/NFIG.
