Table of Contents
Fetching ...

ImageFolder: Autoregressive Image Generation with Folded Tokens

Xiang Li, Kai Qiu, Hao Chen, Jason Kuen, Jiuxiang Gu, Bhiksha Raj, Zhe Lin

TL;DR

This work addresses the tension between token length and quality in autoregressive image generation. It introduces ImageFolder, a semantic, two-branch product-quantized tokenizer that produces spatially aligned tokens and supports folding to shorten AR sequences without sacrificing reconstruction or generation performance. Key contributions include semantic regularization on one PQ branch, quantizer dropout to diversify residual scales, and parallel decoding that predicts two tokens per logit, enabling a shorter effective token length (e.g., 286) while maintaining or improving quality. Empirically, ImageFolder achieves competitive gFID/rFID metrics and favorable latency versus strong baselines, demonstrating efficient, high-quality AR image generation with folded tokens and opening avenues for tighter integration with LLMs and multimodal systems.

Abstract

Image tokenizers are crucial for visual generative models, e.g., diffusion models (DMs) and autoregressive (AR) models, as they construct the latent representation for modeling. Increasing token length is a common approach to improve the image reconstruction quality. However, tokenizers with longer token lengths are not guaranteed to achieve better generation quality. There exists a trade-off between reconstruction and generation quality regarding token length. In this paper, we investigate the impact of token length on both image reconstruction and generation and provide a flexible solution to the tradeoff. We propose ImageFolder, a semantic tokenizer that provides spatially aligned image tokens that can be folded during autoregressive modeling to improve both generation efficiency and quality. To enhance the representative capability without increasing token length, we leverage dual-branch product quantization to capture different contexts of images. Specifically, semantic regularization is introduced in one branch to encourage compacted semantic information while another branch is designed to capture the remaining pixel-level details. Extensive experiments demonstrate the superior quality of image generation and shorter token length with ImageFolder tokenizer.

ImageFolder: Autoregressive Image Generation with Folded Tokens

TL;DR

This work addresses the tension between token length and quality in autoregressive image generation. It introduces ImageFolder, a semantic, two-branch product-quantized tokenizer that produces spatially aligned tokens and supports folding to shorten AR sequences without sacrificing reconstruction or generation performance. Key contributions include semantic regularization on one PQ branch, quantizer dropout to diversify residual scales, and parallel decoding that predicts two tokens per logit, enabling a shorter effective token length (e.g., 286) while maintaining or improving quality. Empirically, ImageFolder achieves competitive gFID/rFID metrics and favorable latency versus strong baselines, demonstrating efficient, high-quality AR image generation with folded tokens and opening avenues for tighter integration with LLMs and multimodal systems.

Abstract

Image tokenizers are crucial for visual generative models, e.g., diffusion models (DMs) and autoregressive (AR) models, as they construct the latent representation for modeling. Increasing token length is a common approach to improve the image reconstruction quality. However, tokenizers with longer token lengths are not guaranteed to achieve better generation quality. There exists a trade-off between reconstruction and generation quality regarding token length. In this paper, we investigate the impact of token length on both image reconstruction and generation and provide a flexible solution to the tradeoff. We propose ImageFolder, a semantic tokenizer that provides spatially aligned image tokens that can be folded during autoregressive modeling to improve both generation efficiency and quality. To enhance the representative capability without increasing token length, we leverage dual-branch product quantization to capture different contexts of images. Specifically, semantic regularization is introduced in one branch to encourage compacted semantic information while another branch is designed to capture the remaining pixel-level details. Extensive experiments demonstrate the superior quality of image generation and shorter token length with ImageFolder tokenizer.
Paper Structure (40 sections, 9 equations, 12 figures, 11 tables)

This paper contains 40 sections, 9 equations, 12 figures, 11 tables.

Figures (12)

  • Figure 1: Illustration of ImageFolder tokenizer and its corresponding autoregressive (AR) modeling with parallel prediction. (a) ImageFolder utilizes product quantization to obtain two sets of spatially aligned tokens that capture distinct aspects of images. (b) With the tokens from ImageFolder, AR models can predict two tokens from one logit thus significantly shortening the sequence length and benefiting the performance.
  • Figure 2: Token dependency.
  • Figure 3: Overview of ImageFolder. ImageFolder leverages vision transformers alexey2020image to encode and decode images. Given an image, two sets of $K\times K$ learnable tokens are used to generate spatially-aligned low-resolution features from the image. After that, a product quantization is used to obtain discrete image representation. A semantic regularization is applied in one of the quantizers to inject semantic constraints. The quantized tokens are concatenated to serve as input for the image decoder to reconstruct images.
  • Figure 4: Illustration of quantizer dropout in multi-scale residual quantizer.
  • Figure 5: Visualization of zero out $z^\prime_s$ or $z^\prime_d$.
  • ...and 7 more figures