Table of Contents
Fetching ...

Mitigating Premature Discretization with Progressive Quantization for Robust Vector Tokenization

Wenhao Zhao, Qiran Zou, Zhouhan Lin, Dianbo Liu

Abstract

Vector Quantization (VQ) has become the cornerstone of tokenization for many multimodal Large Language Models and diffusion synthesis. However, existing VQ paradigms suffer from a fundamental conflict: they enforce discretization before the encoder has captured the underlying data manifold. We term this phenomenon Premature Discretization. To resolve this, we propose Progressive Quantization (ProVQ), which incorporates the dynamics of quantization hardness as a fundamental yet previously overlooked axis in VQ training. By treating quantization as a curriculum that smoothly anneals from a continuous latent space to a discrete one, ProVQ effectively guides the codebook toward the well-expanded manifolds. Extensive experimental results demonstrate the broad effectiveness of ProVQ across diverse modalities. We report improved reconstruction and generative performance on the ImageNet-1K and ImageNet-100 benchmarks, highlighting the ProVQ's boost for generative modeling. Furthermore, ProVQ proves highly effective for modeling complex biological sequences, establishing a new performance ceiling for protein structure tokenization on the StrutTokenBench leaderboard.

Mitigating Premature Discretization with Progressive Quantization for Robust Vector Tokenization

Abstract

Vector Quantization (VQ) has become the cornerstone of tokenization for many multimodal Large Language Models and diffusion synthesis. However, existing VQ paradigms suffer from a fundamental conflict: they enforce discretization before the encoder has captured the underlying data manifold. We term this phenomenon Premature Discretization. To resolve this, we propose Progressive Quantization (ProVQ), which incorporates the dynamics of quantization hardness as a fundamental yet previously overlooked axis in VQ training. By treating quantization as a curriculum that smoothly anneals from a continuous latent space to a discrete one, ProVQ effectively guides the codebook toward the well-expanded manifolds. Extensive experimental results demonstrate the broad effectiveness of ProVQ across diverse modalities. We report improved reconstruction and generative performance on the ImageNet-1K and ImageNet-100 benchmarks, highlighting the ProVQ's boost for generative modeling. Furthermore, ProVQ proves highly effective for modeling complex biological sequences, establishing a new performance ceiling for protein structure tokenization on the StrutTokenBench leaderboard.
Paper Structure (22 sections, 4 equations, 3 figures, 6 tables)

This paper contains 22 sections, 4 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: The Premature Discretization and resulting optimization deadlock. During early training stages, grid mapping forces the embedding distribution to contract and align with a sub-optimal clustered code, while uninformative guidance of embeddings causes the codebook vectors to stagnate. This mutual constraint creates a rigid optimization deadlock, which traps the model in a local minimal state and prevents it from exploring the full, well-distributed latent manifold (right).
  • Figure 2: Empirical Validation on Synthetic 2D datasets. (a) Synthetic dataset composed by Disk shape data plus triangle data to make gridding mapping visible by edge of triangle. (b) Comparison of reconstruction performance over different configurations, demonstrating that both the Soft Transition and the full ProVQ (Soft Transition + Manifold) strategies consistently outperform the Vanilla VQ baseline. (c) Reconstruction improvement relative to Vanilla VQ. While a Soft Transition alone yields substantial gains ($+11.9\%$ for Disk and $+30.7\%$ for Triangle), the integration of a Manifold Warmup further boosts performance, achieving a $+33.1\%$ improvement on the triangle dataset. These results underscore that decoupling continuous and discrete learning at early stage.
  • Figure 3: Comparison of Embedding and Codebook Dynamics during Training. (a) Vanilla VQ: Inward-curved embedding edges signify grid mapping and an optimization deadlock, preventing full manifold coverage. (b) Soft Transition: Relaxes initial constraints to partially mitigate embedding shrinkage and improve codebook migration. (c) ProVQ (Ours): Manifold warm-up followed by soft transition achieves precise topological alignment, effectively resolving the deadlock.