Table of Contents
Fetching ...

GloTok: Global Perspective Tokenizer for Image Reconstruction and Generation

Xuan Zhao, Zhongyu Zhang, Yuge Huang, Yuxi Mi, Guodong Mu, Shouhong Ding, Jun Wang, Rizen Guo, Shuigeng Zhou

TL;DR

This work introduces GloTok, a global-relational image tokenizer with a dual-codebook design that enforces a uniform semantic latent distribution via histogram relation learning. By transferring global relations from pre-trained models and applying residual refinement, GloTok improves image reconstruction and enables high-quality autoregressive generation without needing pretrained models during training. On ImageNet-1K (256×256), it achieves state-of-the-art reconstruction FID and competitive generation performance, with ablations confirming the benefits of global relational supervision and residuals. The approach reduces training complexity and GPU memory while delivering clear gains in latent space uniformity and downstream synthesis quality.

Abstract

Existing state-of-the-art image tokenization methods leverage diverse semantic features from pre-trained vision models for additional supervision, to expand the distribution of latent representations and thereby improve the quality of image reconstruction and generation. These methods employ a locally supervised approach for semantic supervision, which limits the uniformity of semantic distribution. However, VA-VAE proves that a more uniform feature distribution yields better generation performance. In this work, we introduce a Global Perspective Tokenizer (GloTok), which utilizes global relational information to model a more uniform semantic distribution of tokenized features. Specifically, a codebook-wise histogram relation learning method is proposed to transfer the semantics, which are modeled by pre-trained models on the entire dataset, to the semantic codebook. Then, we design a residual learning module that recovers the fine-grained details to minimize the reconstruction error caused by quantization. Through the above design, GloTok delivers more uniformly distributed semantic latent representations, which facilitates the training of autoregressive (AR) models for generating high-quality images without requiring direct access to pre-trained models during the training process. Experiments on the standard ImageNet-1k benchmark clearly show that our proposed method achieves state-of-the-art reconstruction performance and generation quality.

GloTok: Global Perspective Tokenizer for Image Reconstruction and Generation

TL;DR

This work introduces GloTok, a global-relational image tokenizer with a dual-codebook design that enforces a uniform semantic latent distribution via histogram relation learning. By transferring global relations from pre-trained models and applying residual refinement, GloTok improves image reconstruction and enables high-quality autoregressive generation without needing pretrained models during training. On ImageNet-1K (256×256), it achieves state-of-the-art reconstruction FID and competitive generation performance, with ablations confirming the benefits of global relational supervision and residuals. The approach reduces training complexity and GPU memory while delivering clear gains in latent space uniformity and downstream synthesis quality.

Abstract

Existing state-of-the-art image tokenization methods leverage diverse semantic features from pre-trained vision models for additional supervision, to expand the distribution of latent representations and thereby improve the quality of image reconstruction and generation. These methods employ a locally supervised approach for semantic supervision, which limits the uniformity of semantic distribution. However, VA-VAE proves that a more uniform feature distribution yields better generation performance. In this work, we introduce a Global Perspective Tokenizer (GloTok), which utilizes global relational information to model a more uniform semantic distribution of tokenized features. Specifically, a codebook-wise histogram relation learning method is proposed to transfer the semantics, which are modeled by pre-trained models on the entire dataset, to the semantic codebook. Then, we design a residual learning module that recovers the fine-grained details to minimize the reconstruction error caused by quantization. Through the above design, GloTok delivers more uniformly distributed semantic latent representations, which facilitates the training of autoregressive (AR) models for generating high-quality images without requiring direct access to pre-trained models during the training process. Experiments on the standard ImageNet-1k benchmark clearly show that our proposed method achieves state-of-the-art reconstruction performance and generation quality.

Paper Structure

This paper contains 34 sections, 13 equations, 7 figures, 6 tables, 1 algorithm.

Figures (7)

  • Figure 1: Comparison of generative performance among different methods and GloTok, where lower values on the Y-axis correspond to better performance.
  • Figure 2: Illustration of our method. Top: GloTok encoder-quantizer-decoder architecture with dual codebooks and residual modules. Bottom: overview of the Histogram Relation Learning method. An image is quantized into two sets of tokens by a visual codebook and a semantic codebook. The semantic codebook learns the token relationship from features clustered from a pre-trained model with a histogram loss. GloTok adopts two residual modules to learn the residuals between continuous features and discrete features.
  • Figure 3: ImageNet-1k 256$\times$256 generated samples of GloTok trained with xAR.
  • Figure 4: Visualization of the reconstruction on different retention rates of visual features. Each row presents visualization results for visual features with different retention ratios, both with and without semantic features.
  • Figure 5: Framework comparison between SOTAs and GloTok.
  • ...and 2 more figures