Table of Contents
Fetching ...

IAR2: Improving Autoregressive Visual Generation with Semantic-Detail Associated Token Prediction

Ran Yi, Teng Hu, Zihan Su, Lizhuang Ma

TL;DR

IAR2 addresses the bottleneck of single-codebook autoregressive image generation by introducing a Semantic-Detail Associated Dual Codebook that decouples global semantics from fine details. It couples a Hierarchical Semantic-Detail Autoregressive Prediction with a Local-Context Enhanced AR Head and a Progressive Attention-Guided CFG mechanism to achieve strong conditional alignment without sacrificing realism. The approach yields state-of-the-art results on ImageNet (FID $=1.50$) with substantially better efficiency, evidencing robust coarse-to-fine generation and scalable performance. This structured, hierarchical token modeling offers a practical path toward high-fidelity, conditioned autoregressive visual generation with improved interpretability and efficiency.

Abstract

Autoregressive models have emerged as a powerful paradigm for visual content creation, but often overlook the intrinsic structural properties of visual data. Our prior work, IAR, initiated a direction to address this by reorganizing the visual codebook based on embedding similarity, thereby improving generation robustness. However, it is constrained by the rigidity of pre-trained codebooks and the inaccuracies of hard, uniform clustering. To overcome these limitations, we propose IAR2, an advanced autoregressive framework that enables a hierarchical semantic-detail synthesis process. At the core of IAR2 is a novel Semantic-Detail Associated Dual Codebook, which decouples image representations into a semantic codebook for global semantic information and a detail codebook for fine-grained refinements. It expands the quantization capacity from a linear to a polynomial scale, significantly enhancing expressiveness. To accommodate this dual representation, we propose a Semantic-Detail Autoregressive Prediction scheme coupled with a Local-Context Enhanced Autoregressive Head, which performs hierarchical prediction-first the semantic token, then the detail token-while leveraging a local context window to enhance spatial coherence. Furthermore, for conditional generation, we introduce a Progressive Attention-Guided Adaptive CFG mechanism that dynamically modulates the guidance scale for each token based on its relevance to the condition and its temporal position in the generation sequence, improving conditional alignment without sacrificing realism. Extensive experiments demonstrate that IAR2 sets a new state-of-the-art for autoregressive image generation, achieving a FID of 1.50 on ImageNet. Our model not only surpasses previous methods in performance but also demonstrates superior computational efficiency, highlighting the effectiveness of our structured, coarse-to-fine generation strategy.

IAR2: Improving Autoregressive Visual Generation with Semantic-Detail Associated Token Prediction

TL;DR

IAR2 addresses the bottleneck of single-codebook autoregressive image generation by introducing a Semantic-Detail Associated Dual Codebook that decouples global semantics from fine details. It couples a Hierarchical Semantic-Detail Autoregressive Prediction with a Local-Context Enhanced AR Head and a Progressive Attention-Guided CFG mechanism to achieve strong conditional alignment without sacrificing realism. The approach yields state-of-the-art results on ImageNet (FID ) with substantially better efficiency, evidencing robust coarse-to-fine generation and scalable performance. This structured, hierarchical token modeling offers a practical path toward high-fidelity, conditioned autoregressive visual generation with improved interpretability and efficiency.

Abstract

Autoregressive models have emerged as a powerful paradigm for visual content creation, but often overlook the intrinsic structural properties of visual data. Our prior work, IAR, initiated a direction to address this by reorganizing the visual codebook based on embedding similarity, thereby improving generation robustness. However, it is constrained by the rigidity of pre-trained codebooks and the inaccuracies of hard, uniform clustering. To overcome these limitations, we propose IAR2, an advanced autoregressive framework that enables a hierarchical semantic-detail synthesis process. At the core of IAR2 is a novel Semantic-Detail Associated Dual Codebook, which decouples image representations into a semantic codebook for global semantic information and a detail codebook for fine-grained refinements. It expands the quantization capacity from a linear to a polynomial scale, significantly enhancing expressiveness. To accommodate this dual representation, we propose a Semantic-Detail Autoregressive Prediction scheme coupled with a Local-Context Enhanced Autoregressive Head, which performs hierarchical prediction-first the semantic token, then the detail token-while leveraging a local context window to enhance spatial coherence. Furthermore, for conditional generation, we introduce a Progressive Attention-Guided Adaptive CFG mechanism that dynamically modulates the guidance scale for each token based on its relevance to the condition and its temporal position in the generation sequence, improving conditional alignment without sacrificing realism. Extensive experiments demonstrate that IAR2 sets a new state-of-the-art for autoregressive image generation, achieving a FID of 1.50 on ImageNet. Our model not only surpasses previous methods in performance but also demonstrates superior computational efficiency, highlighting the effectiveness of our structured, coarse-to-fine generation strategy.

Paper Structure

This paper contains 37 sections, 17 equations, 11 figures, 13 tables.

Figures (11)

  • Figure 1: Performance Comparison with the state-of-the-art methods on ImageNet. Our model can always achieve the best FID under the same model parameters. Moreover, our IAR2 also achieves the best FID (FID=1.50) across different model and model sizes.
  • Figure 2: The IAR Framework: IAR begins by rearranging its codebook to group semantically similar image embeddings into distinct clusters. Subsequently, during the training of the autoregressive model, IAR introduces a cluster-level constraint. This constraint guides the model to predict the correct cluster index for a given image, ensuring that the generated embedding is close to the target. This approach significantly enhances the robustness and overall performance of the AR model.
  • Figure 3: (a) The MSE and LPIPS between the source image and the reconstructed image under different code distances. (b) Visualization of decoded images at varying code distances.
  • Figure 4: Impact of codebook size on reconstruction and generation. While increasing the codebook size enhances reconstruction accuracy, an excessively large codebook complicates the learning task for the generative model, leading to degraded generation quality. In contrast, our semantic-detail associated quantization strategy strikes an effective balance, achieving high fidelity in both reconstruction and generation.
  • Figure 5: IAR2 consists of three main modules: 1) The Semantic-Detail Associated Quantization Module disentangles an input image into two distinct sets of discrete codes: semantic codes for high-level content and detail codes for fine-grained visual information; 2) The Semantic-Detail Autoregression Model processes these token pairs by fusing them into a unified hidden state, which is then fed into an autoregressive backbone to obtain global contexts; 3) The Local-Context Enhanced Autoregression Headperforms hierarchical prediction of semantic and detail tokens, and leverages neighboring local context tokens to enrich the local information, thereby enhancing generation accuracy for both semantic and detail codes.
  • ...and 6 more figures