Table of Contents
Fetching ...

LGQ: Learning Discretization Geometry for Scalable and Stable Image Tokenization

Idil Bilge Altun, Mert Onur Cakiroglu, Elham Buxton, Mehmet Dalkilic, Hasan Kurban

TL;DR

learnable Geometric Quantization (LGQ), a discrete image tokenizer that learns discretization geometry end-to-end and achieves stable optimization and balanced utilization under a controlled VQGAN-style backbone on ImageNet.

Abstract

Discrete image tokenization is a key bottleneck for scalable visual generation: a tokenizer must remain compact for efficient latent-space priors while preserving semantic structure and using discrete capacity effectively. Existing quantizers face a trade-off: vector-quantized tokenizers learn flexible geometries but often suffer from biased straight-through optimization, codebook under-utilization, and representation collapse at large vocabularies. Structured scalar or implicit tokenizers ensure stable, near-complete utilization by design, yet rely on fixed discretization geometries that may allocate capacity inefficiently under heterogeneous latent statistics. We introduce Learnable Geometric Quantization (LGQ), a discrete image tokenizer that learns discretization geometry end-to-end. LGQ replaces hard nearest-neighbor lookup with temperature-controlled soft assignments, enabling fully differentiable training while recovering hard assignments at inference. The assignments correspond to posterior responsibilities of an isotropic Gaussian mixture and minimize a variational free-energy objective, provably converging to nearest-neighbor quantization in the low-temperature limit. LGQ combines a token-level peakedness regularizer with a global usage regularizer to encourage confident yet balanced code utilization without imposing rigid grids. Under a controlled VQGAN-style backbone on ImageNet across multiple vocabulary sizes, LGQ achieves stable optimization and balanced utilization. At 16K codebook size, LGQ improves rFID by 11.88% over FSQ while using 49.96% fewer active codes, and improves rFID by 6.06% over SimVQ with 49.45% lower effective representation rate, achieving comparable fidelity with substantially fewer active entries. Our GitHub repository is available at: https://github.com/KurbanIntelligenceLab/LGQ

LGQ: Learning Discretization Geometry for Scalable and Stable Image Tokenization

TL;DR

learnable Geometric Quantization (LGQ), a discrete image tokenizer that learns discretization geometry end-to-end and achieves stable optimization and balanced utilization under a controlled VQGAN-style backbone on ImageNet.

Abstract

Discrete image tokenization is a key bottleneck for scalable visual generation: a tokenizer must remain compact for efficient latent-space priors while preserving semantic structure and using discrete capacity effectively. Existing quantizers face a trade-off: vector-quantized tokenizers learn flexible geometries but often suffer from biased straight-through optimization, codebook under-utilization, and representation collapse at large vocabularies. Structured scalar or implicit tokenizers ensure stable, near-complete utilization by design, yet rely on fixed discretization geometries that may allocate capacity inefficiently under heterogeneous latent statistics. We introduce Learnable Geometric Quantization (LGQ), a discrete image tokenizer that learns discretization geometry end-to-end. LGQ replaces hard nearest-neighbor lookup with temperature-controlled soft assignments, enabling fully differentiable training while recovering hard assignments at inference. The assignments correspond to posterior responsibilities of an isotropic Gaussian mixture and minimize a variational free-energy objective, provably converging to nearest-neighbor quantization in the low-temperature limit. LGQ combines a token-level peakedness regularizer with a global usage regularizer to encourage confident yet balanced code utilization without imposing rigid grids. Under a controlled VQGAN-style backbone on ImageNet across multiple vocabulary sizes, LGQ achieves stable optimization and balanced utilization. At 16K codebook size, LGQ improves rFID by 11.88% over FSQ while using 49.96% fewer active codes, and improves rFID by 6.06% over SimVQ with 49.45% lower effective representation rate, achieving comparable fidelity with substantially fewer active entries. Our GitHub repository is available at: https://github.com/KurbanIntelligenceLab/LGQ
Paper Structure (23 sections, 4 theorems, 14 equations, 8 figures, 5 tables)

This paper contains 23 sections, 4 theorems, 14 equations, 8 figures, 5 tables.

Key Result

Theorem 3.1

Suppose $c_1,\dots,c_K$ are pairwise distinct. For fixed $z$, the soft assignment $p(k\,|\,z)$ defined in eq:gibbs_posterior2 converges to a one–hot distribution as $\tau \to 0$: where

Figures (8)

  • Figure 1: Discretization geometries in latent tokenizers. (Left:) VQ maps $z$ to the nearest learned codeword $\hat{z}$ (Voronoi partitions). (Middle:) FSQ quantizes each dimension with fixed axis-aligned bins (implicit lattice). (Right:) LGQ (ours) uses temperature-controlled soft assignments $p_{t,k}$ with straight-through hard selection, learning discretization geometry end-to-end for smoother optimization and balanced utilization.
  • Figure 2: LGQ-VAE pipeline. The encoder $f_\theta$ maps an input image $x$ to a continuous latent representation $z_e$, which is discretized using Learnable Geometric Quantization (LGQ) before reconstruction by the decoder $g_\phi$. Each latent token is softly assigned to a shared learnable codebook via temperature-controlled distance-based probabilities, inducing a continuous assignment geometry over codebook entries. During training, soft assignments enable end-to-end optimization of both encoder and codebook geometry; as the temperature is annealed, assignments gradually sharpen and converge to hard quantization. The bottom panel illustrates how active codebook entries adapt to the latent distribution over training iterations, with inactive (“dead”) bins remaining unused. This soft-to-hard discretization mechanism allows LGQ to learn an efficient, data-aligned quantization structure, avoiding fixed grids and hard nearest- neighbor assignments used in standard VQ-VAEs.
  • Figure 3: Training dynamics of reconstruction quality metrics. Per-epoch comparison of models at $128 \times 128$ resolution over 61 training epochs. Metrics include relative Fréchet Inception Distance (rFID, $\downarrow$), peak signal-to-noise ratio (PSNR, $\uparrow$), structural similarity index (SSIM, $\uparrow$), and learned perceptual image patch similarity (LPIPS, $\downarrow$).
  • Figure 4: Visualization of encoder outputs (blue) and active codebook entries (red) using UMAP. Blue contours represent the density of encoder outputs, while red points denote active codebook entries.The latent distributions differ across methods as each quantizer is trained end-to-end with its own encoder; the relevant comparison is the alignment between codes (red) and latent density (blue) within each plot, not the shape of the latent space across plots. SimVQ achieves dense coverage of the latent manifold by activating nearly the entire codebook, whereas LGQ attains comparable coverage using substantially fewer active codes. This illustrates that LGQ allocates codebook capacity more selectively, supporting similar reconstruction quality at lower effective representation rates.
  • Figure 5: Evolution of discretization bin centers during training.Right: Trajectories of a subset of individual bin centers over training epochs, highlighting structured and heterogeneous adaptation. Left: Distribution of bin center values at initialization and at the final epoch, showing a learned, non-trivial reshaping of the codebook.
  • ...and 3 more figures

Theorems & Definitions (4)

  • Theorem 3.1: Soft–to–hard convergence
  • Proposition 3.2: Lipschitz continuity
  • Proposition A.1: Peaked assignments
  • Proposition A.2: Balanced codebook utilization