Table of Contents
Fetching ...

Improving Autoregressive Visual Generation with Cluster-Oriented Token Prediction

Teng Hu, Jiangning Zhang, Ran Yi, Jieyu Weng, Yabiao Wang, Xianfang Zeng, Zhucun Xue, Lizhuang Ma

TL;DR

The paper addresses how LLM-based visual generation often treats images as token indices, overlooking fundamental differences between language and vision. It introduces IAR, comprising Codebook Rearrangement (via balanced K-means) and a Cluster-oriented Cross-Entropy Loss, to exploit embedding similarities and relax token prediction to cluster prediction, improving robustness and training efficiency. The authors analyze image embedding similarity, formalize cluster-based objectives, and demonstrate across 100M–1.4B parameter scales that IAR reduces training time while maintaining or enhancing generation quality, achieving strong FID/IS on ImageNet and showing applicability to various LLM-based visual models. This work provides a practical, scalable path to more efficient and robust autoregressive visual generation and suggests future exploration of continuous embedding constraints within discrete-token frameworks, with broad implications for multi-modal LLMs.

Abstract

Employing LLMs for visual generation has recently become a research focus. However, the existing methods primarily transfer the LLM architecture to visual generation but rarely investigate the fundamental differences between language and vision. This oversight may lead to suboptimal utilization of visual generation capabilities within the LLM framework. In this paper, we explore the characteristics of visual embedding space under the LLM framework and discover that the correlation between visual embeddings can help achieve more stable and robust generation results. We present IAR, an Improved AutoRegressive Visual Generation Method that enhances the training efficiency and generation quality of LLM-based visual generation models. Firstly, we propose a Codebook Rearrangement strategy that uses balanced k-means clustering algorithm to rearrange the visual codebook into clusters, ensuring high similarity among visual features within each cluster. Leveraging the rearranged codebook, we propose a Cluster-oriented Cross-entropy Loss that guides the model to correctly predict the cluster where the token is located. This approach ensures that even if the model predicts the wrong token index, there is a high probability the predicted token is located in the correct cluster, which significantly enhances the generation quality and robustness. Extensive experiments demonstrate that our method consistently enhances the model training efficiency and performance from 100M to 1.4B, reducing the training time by half while achieving the same FID. Additionally, our approach can be applied to various LLM-based visual generation models and adheres to the scaling law, providing a promising direction for future research in LLM-based visual generation. The code is available at: https://github.com/sjtuplayer/IAR.

Improving Autoregressive Visual Generation with Cluster-Oriented Token Prediction

TL;DR

The paper addresses how LLM-based visual generation often treats images as token indices, overlooking fundamental differences between language and vision. It introduces IAR, comprising Codebook Rearrangement (via balanced K-means) and a Cluster-oriented Cross-Entropy Loss, to exploit embedding similarities and relax token prediction to cluster prediction, improving robustness and training efficiency. The authors analyze image embedding similarity, formalize cluster-based objectives, and demonstrate across 100M–1.4B parameter scales that IAR reduces training time while maintaining or enhancing generation quality, achieving strong FID/IS on ImageNet and showing applicability to various LLM-based visual models. This work provides a practical, scalable path to more efficient and robust autoregressive visual generation and suggests future exploration of continuous embedding constraints within discrete-token frameworks, with broad implications for multi-modal LLMs.

Abstract

Employing LLMs for visual generation has recently become a research focus. However, the existing methods primarily transfer the LLM architecture to visual generation but rarely investigate the fundamental differences between language and vision. This oversight may lead to suboptimal utilization of visual generation capabilities within the LLM framework. In this paper, we explore the characteristics of visual embedding space under the LLM framework and discover that the correlation between visual embeddings can help achieve more stable and robust generation results. We present IAR, an Improved AutoRegressive Visual Generation Method that enhances the training efficiency and generation quality of LLM-based visual generation models. Firstly, we propose a Codebook Rearrangement strategy that uses balanced k-means clustering algorithm to rearrange the visual codebook into clusters, ensuring high similarity among visual features within each cluster. Leveraging the rearranged codebook, we propose a Cluster-oriented Cross-entropy Loss that guides the model to correctly predict the cluster where the token is located. This approach ensures that even if the model predicts the wrong token index, there is a high probability the predicted token is located in the correct cluster, which significantly enhances the generation quality and robustness. Extensive experiments demonstrate that our method consistently enhances the model training efficiency and performance from 100M to 1.4B, reducing the training time by half while achieving the same FID. Additionally, our approach can be applied to various LLM-based visual generation models and adheres to the scaling law, providing a promising direction for future research in LLM-based visual generation. The code is available at: https://github.com/sjtuplayer/IAR.
Paper Structure (23 sections, 13 equations, 12 figures, 15 tables, 1 algorithm)

This paper contains 23 sections, 13 equations, 12 figures, 15 tables, 1 algorithm.

Figures (12)

  • Figure 1: When an autoregressive model predicts a wrong token, the previous methods llamagenVAR may predict an irrelevant token that causes artifacts. Our IAR alleviates this issue by ensuring a high probability of the predicted token located in the correct cluster.
  • Figure 2: (a) The MSE and LPIPS between the source image and the decoded image with different code distances. (b) The visualization of the images decoded from different code distances. When the code distance is within a certain range (e.g., smaller than 12), the decoded image looks nearly identical to the source image.We further make use of this property to improve the LLM-based visual generation model.
  • Figure 3: Model framework: 1) Codebook Rearrangement: we first use a balanced K-means clustering method to rearrange the codebook, which divides the codebook into $n$ clusters, with the image codes in each cluster sharing a high similarity. 2) Cluster-oriented Constraint: During the training process, we first quantize the image patches using the rearranged codebook. For the output probability distribution $\hat{Y}$, we further compute the cluster-level distribution $\hat{Y}_C$ by applying LogSumExp operation for the probabilities in each cluster $\hat{Y}_{\textcolor{black}{jm}}\sim \hat{Y}_{\textcolor{black}{(j+1)m-1}}$. Then we compute the cluster-oriented cross-entropy loss $\mathcal{L}_{CCE}$apart from the token-oriented cross-entropy loss $\mathcal{L}_{TCE}$, which ensures a high probability of the predicted token located in the correct cluster, thereby enhancing generation quality.
  • Figure 4: (a) Model performance (IAR-B and IAR-L) across different CFGs; (b) Model performance on different parameter numbers (111M to 3B) compared to LlamaGen; and (c) Model performance on different epochs compared to LlamaGen.
  • Figure : Balanced k-means Clustering
  • ...and 7 more figures