Table of Contents
Fetching ...

DetailFlow: 1D Coarse-to-Fine Autoregressive Image Generation via Next-Detail Prediction

Yiheng Liu, Liao Qu, Huichao Zhang, Xu Wang, Yi Jiang, Yiming Gao, Hu Ye, Xian Li, Shuai Wang, Daniel K. Du, Fangmin Chen, Zehuan Yuan, Xinglong Wu

TL;DR

DetailFlow introduces Next-Detail Prediction, a coarse-to-fine 1D autoregressive image generation approach that uses a resolution-aware token sequence to map token counts to image resolution via a mapping $\,\mathcal{R}(n)$. A 1D tokenizer compresses 2D image information into semantically ordered latent tokens, enabling global-to-local synthesis with significantly fewer tokens than prior AR methods. The framework supports parallel inference with a self-correction mechanism, accelerating generation by about $8\times$ and mitigating error accumulation, while alignment of the first token to global Siglip2 features further improves fidelity. On ImageNet-256$\times$256, DetailFlow-16 with 128 tokens achieves $2.96$ gFID, outperforming state-of-the-art AR models that use far more tokens, and demonstrates strong efficiency gains, positioning it as a scalable solution for high-resolution autoregressive image synthesis.

Abstract

This paper presents DetailFlow, a coarse-to-fine 1D autoregressive (AR) image generation method that models images through a novel next-detail prediction strategy. By learning a resolution-aware token sequence supervised with progressively degraded images, DetailFlow enables the generation process to start from the global structure and incrementally refine details. This coarse-to-fine 1D token sequence aligns well with the autoregressive inference mechanism, providing a more natural and efficient way for the AR model to generate complex visual content. Our compact 1D AR model achieves high-quality image synthesis with significantly fewer tokens than previous approaches, i.e. VAR/VQGAN. We further propose a parallel inference mechanism with self-correction that accelerates generation speed by approximately 8x while reducing accumulation sampling error inherent in teacher-forcing supervision. On the ImageNet 256x256 benchmark, our method achieves 2.96 gFID with 128 tokens, outperforming VAR (3.3 FID) and FlexVAR (3.05 FID), which both require 680 tokens in their AR models. Moreover, due to the significantly reduced token count and parallel inference mechanism, our method runs nearly 2x faster inference speed compared to VAR and FlexVAR. Extensive experimental results demonstrate DetailFlow's superior generation quality and efficiency compared to existing state-of-the-art methods.

DetailFlow: 1D Coarse-to-Fine Autoregressive Image Generation via Next-Detail Prediction

TL;DR

DetailFlow introduces Next-Detail Prediction, a coarse-to-fine 1D autoregressive image generation approach that uses a resolution-aware token sequence to map token counts to image resolution via a mapping . A 1D tokenizer compresses 2D image information into semantically ordered latent tokens, enabling global-to-local synthesis with significantly fewer tokens than prior AR methods. The framework supports parallel inference with a self-correction mechanism, accelerating generation by about and mitigating error accumulation, while alignment of the first token to global Siglip2 features further improves fidelity. On ImageNet-256256, DetailFlow-16 with 128 tokens achieves gFID, outperforming state-of-the-art AR models that use far more tokens, and demonstrates strong efficiency gains, positioning it as a scalable solution for high-resolution autoregressive image synthesis.

Abstract

This paper presents DetailFlow, a coarse-to-fine 1D autoregressive (AR) image generation method that models images through a novel next-detail prediction strategy. By learning a resolution-aware token sequence supervised with progressively degraded images, DetailFlow enables the generation process to start from the global structure and incrementally refine details. This coarse-to-fine 1D token sequence aligns well with the autoregressive inference mechanism, providing a more natural and efficient way for the AR model to generate complex visual content. Our compact 1D AR model achieves high-quality image synthesis with significantly fewer tokens than previous approaches, i.e. VAR/VQGAN. We further propose a parallel inference mechanism with self-correction that accelerates generation speed by approximately 8x while reducing accumulation sampling error inherent in teacher-forcing supervision. On the ImageNet 256x256 benchmark, our method achieves 2.96 gFID with 128 tokens, outperforming VAR (3.3 FID) and FlexVAR (3.05 FID), which both require 680 tokens in their AR models. Moreover, due to the significantly reduced token count and parallel inference mechanism, our method runs nearly 2x faster inference speed compared to VAR and FlexVAR. Extensive experimental results demonstrate DetailFlow's superior generation quality and efficiency compared to existing state-of-the-art methods.

Paper Structure

This paper contains 24 sections, 4 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: (a) Progressive generation results from DetailFlow. Our proposed 1D tokenizer encodes tokens with an inherent semantic ordering, where each subsequent token contributes additional high-resolution information. The sequences illustrate how image resolution and inferred 1D tokens incrementally increase from left to right. (b) Comparison of our DetailFlow approach with existing methods, showing that DetailFlow achieves better image quality with fewer tokens and times.
  • Figure 2: Comparison of different prediction strategies in image generation. (a) Traditional 2D raster-scan next-token/next-patch prediction. (b) Next-scale prediction in VAR tian2024visual. (c) Our proposed next-detail prediction, which predicts 1D tokens encoding fine-grained details for high-resolution image generation.
  • Figure 3: (a) Coarse-to-fine tokenizer training. The encoder maps high-resolution images to 1D latent token sequences. Decoding with more tokens yields higher-resolution outputs, with earlier tokens capturing global structure and later ones refining details. (b) Self-correction training. Randomly perturbed tokens are re-encoded, and encourages subsequent tokens to correct errors from earlier noisy tokens. (c) Autoregressive (AR) model training and decoding. AR model predicts the first group of tokens in a next-token prediction manner, followed by parallel prediction of subsequent groups. At inference, more predicted tokens lead to higher-resolution outputs.
  • Figure 4: (a) Reconstruction metrics before and after self-correction when adding noise to latent tokens of a group (tokenizer with 128 tokens, group size 8, trained for 200 epochs). (b) Impact of token count on image resolution, reconstruction quality (rFID), and generation quality (gFID), with all evaluations conducted on images resized to $256 \times 256$. The tokenizer is identical to (a). (c) Influence of the hyperparameter $\alpha$ in the mapping function $\mathcal{R}(n)$ on generation metrics, using tokenizers trained for 50 epochs.
  • Figure 5: Qualitative comparison of AR model outputs with (w/) and without (w/o) self-correction training.
  • ...and 3 more figures