Early Quantization Shrinks Codebook: A Simple Fix for Diversity-Preserving Tokenization

Wenhao Zhao; Qiran Zou; Rushi Shah; Yudi Wu; Zhouhan Lin; Dianbo Liu

Early Quantization Shrinks Codebook: A Simple Fix for Diversity-Preserving Tokenization

Wenhao Zhao, Qiran Zou, Rushi Shah, Yudi Wu, Zhouhan Lin, Dianbo Liu

Abstract

Vector quantization is a technique in machine learning that discretizes continuous representations into a set of discrete vectors. It is widely employed in tokenizing data representations for large language models, diffusion models, and other generative models. Despite its prevalence, the characteristics and behaviors of vector quantization in generative models remain largely underexplored. In this study, we systematically investigate the issue of collapses in vector quantization, where collapsed representations are observed across discrete codebook tokens and continuous latent embeddings. By leveraging both synthetic and real datasets, we identify the severity of each type of collapses and triggering conditions. Our analysis reveals that random initialization and limited encoder capacity result in tokens collapse and embeddings collapse. Building on these findings, we propose potential solutions aimed at mitigating each collapse. To the best of our knowledge, this is the first comprehensive study examining representation collapsing problems in vector quantization.

Early Quantization Shrinks Codebook: A Simple Fix for Diversity-Preserving Tokenization

Abstract

Paper Structure (42 sections, 7 equations, 6 figures, 11 tables)

This paper contains 42 sections, 7 equations, 6 figures, 11 tables.

Introduction
Related Works
Preliminary
Mode Collapse and Limited Diversity
Vector Quatization
VQ-VAE
Token Representation Shrinkage
Empirical Characterization
Definition.
Synthetic evidence: shrinkage induces reconstruction collapse.
Root cause for shrinkage: clustered initialization from an untrained encoder.
Intervention for controlled study.
Result: mitigating shrinkage improves modality coverage.
Theoretical Analysis
Setup.
...and 27 more sections

Figures (6)

Figure 1: Token representation shrinkage degrade the diversity of generative model. Vector quantization is a widely used technique to map continuous data into discrete tokens, which assists the transformer-based generative model's generation. We observe that token representation shrinkage, manifested as narrow distribution in latent space, leads to a shrunk distribution of the generated data.
Figure 2: Token representation shrinkage shrinks latent support and induces reconstruction mode collapse while Deferred Quantization mitigates it. (a) w/o deferred quantization: tokens cluster into a narrow region of the embedding space, reducing coverage and causing reconstructions to collapse onto fewer modes across different input dimensions. (b) w/ deferred quantization: tokens spread across the embedding space, improving latent support coverage and yielding reconstructions with better modality coverage.
Figure 3: Early quantization with clustered initialization induces token representation shrinkage while Deferred Quantization mitigates it. Initializing the codebook from an untrained encoder yields a narrow, uninformative embedding distribution, causing tokens to cluster and shrink latent support at early stage. In contrast, Deferred Quantization first learns a dispersed continuous representation and then initializes the codebook with semantic embeddings from the pretrained encoder, yielding better token coverage and reducing shrinkage.
Figure 4: Deferred Quantization alleviates the token representation shrinkage on CIFAR-10. (a) Token representation shrinkage in standard VQ impairs the model's ability to scale, leading to stagnating MSE. Deferred Quantization mitigates this shrinkage, allowing the model to achieve better reconstruction performance as number of token increase. (b)Without Deferred Quantization, perplexity remains low, indicating highly uneven token usage. Deferred Quantization resolves this by spreading representations across the embedding space, ensuring high perplexity and efficient utilization of the available latent capacity. (c) The sharp drop in codebook Euclidean Distance for standard VQ indicates that tokens are clustering into a narrow region (shrinkage). Deferred Quantization maintains a high distance between tokens, preserving codebook coverage.
Figure 5: Images generated using VAR. (a) ImageNet (a.left) and real-world medical images of eyes (a.right) generated by VAR w/o deferred quantization. (b) ImageNet (b.left) and real-world medical images of eyes (b.right) generated by VAR w/ deferred quantization.
...and 1 more figures

Early Quantization Shrinks Codebook: A Simple Fix for Diversity-Preserving Tokenization

Abstract

Early Quantization Shrinks Codebook: A Simple Fix for Diversity-Preserving Tokenization

Authors

Abstract

Table of Contents

Figures (6)