VAEVQ: Enhancing Discrete Visual Tokenization through Variational Modeling

Sicheng Yang; Xing Hu; Qiang Wu; Dawei Yang

VAEVQ: Enhancing Discrete Visual Tokenization through Variational Modeling

Sicheng Yang, Xing Hu, Qiang Wu, Dawei Yang

TL;DR

VAEVQ tackles key limitations of discrete visual tokenizers by integrating a variational latent space with vector quantization. The framework combines Variational Latent Quantization (VLQ), Representation Coherence Strategy (RCS), and Distribution Consistency Regularization (DCR) to achieve smoother latent manifolds, stronger local alignment, and globally balanced codebook distributions. Empirical results on ImageNet and BraTS24 show superior reconstruction and generation quality, along with near-complete codeword utilization and robustness across domains, without relying on pretrained models. These advances enhance the practicality and expressiveness of discrete visual tokens for autoregressive and diffusion-based image generation tasks.

Abstract

Vector quantization (VQ) transforms continuous image features into discrete representations, providing compressed, tokenized inputs for generative models. However, VQ-based frameworks suffer from several issues, such as non-smooth latent spaces, weak alignment between representations before and after quantization, and poor coherence between the continuous and discrete domains. These issues lead to unstable codeword learning and underutilized codebooks, ultimately degrading the performance of both reconstruction and downstream generation tasks. To this end, we propose VAEVQ, which comprises three key components: (1) Variational Latent Quantization (VLQ), replacing the AE with a VAE for quantization to leverage its structured and smooth latent space, thereby facilitating more effective codeword activation; (2) Representation Coherence Strategy (RCS), adaptively modulating the alignment strength between pre- and post-quantization features to enhance consistency and prevent overfitting to noise; and (3) Distribution Consistency Regularization (DCR), aligning the entire codebook distribution with the continuous latent distribution to improve utilization. Extensive experiments on two benchmark datasets demonstrate that VAEVQ outperforms state-of-the-art methods.

VAEVQ: Enhancing Discrete Visual Tokenization through Variational Modeling

TL;DR

Abstract

Paper Structure (26 sections, 15 equations, 9 figures, 3 tables)

This paper contains 26 sections, 15 equations, 9 figures, 3 tables.

Introduction
Related Work
Discrete Visual Tokenizers
Visual Tokenizers for Image Generation
Methodology
Overview
Variational Latent Quantization (VLQ)
Representation Coherence Strategy (RCS)
Distribution Consistency Regularization (DCR)
Training Objective
Experiments
Datasets and Implementation Details
Datasets.
Implementation Details.
Visual Reconstruction Performance
...and 11 more sections

Figures (9)

Figure 1: Comparison of different VQ strategies. (a) Direct quantization over AE latents often leads to codebook collapse, as the latent space of AE is typically irregular and fragmented, making it suboptimal for quantization. (b) VLQ introduces variational modeling to smooth the transition between pre- and post-quantization representations, enabling more effective codeword activation and updating. (c) The complete VAEVQ framework, augmented with RCS and DCR, achieves high efficiency (i.e., without pretrained models such as DINO) and high codebook utilization.
Figure 2: Overview of the proposed VAEVQ framework. The VLQ module encodes the input into a variational latent vector $z_c$ and quantizes it into $z_q$, followed by dual-path decoding to enforce consistency. RCS imposes a variance-aware alignment between $z_c$ and $z_q$ to preserve confident features while tolerating uncertainty. DCR aligns the codebook distribution with the VAE prior via optimal transport. Through the joint effect of these modules, the codebook is progressively updated during training, leading to improved utilization and higher-quality visual tokens.
Figure 3: Comparison between vanilla vector quantization and our proposed Variational Latent Quantization (VLQ). (a) In vanilla VQ, latent features from the autoencoder (AE) latent space are sparse and rigid, causing most initial codewords (orange) to remain unused. As a result, many codewords become inactive (red), and only a few (green) are eventually trained, leading to low codebook utilization. (b) In VLQ, latent vectors are drawn from the VAE latent space, which has a smoother distribution. This enables more codewords to be activated and gradually updated.
Figure 4: Conceptual illustration of the progressive alignment among the VQ space, continuous latent space (VAE), and the prior distribution. (a) VQ and VAE are partially aligned, but both remain misaligned with the prior. (b) RCS encourages instance-level alignment between VQ and VAE, reducing their local discrepancies. However, some regions of the latent space remain unaligned. (c) DCR regularizes the codebook distribution to match the Gaussian prior, yielding a diverse and well-structured codebook whose space is aligned with both the VAE latent space and the prior.
Figure 5: Codebook utilization rates (%) of different tokenizers on ImageNet and BraTS24. VAEVQ achieves significantly higher utilization across both datasets, indicating more effective and diverse token usage.
...and 4 more figures

VAEVQ: Enhancing Discrete Visual Tokenization through Variational Modeling

TL;DR

Abstract

VAEVQ: Enhancing Discrete Visual Tokenization through Variational Modeling

Authors

TL;DR

Abstract

Table of Contents

Figures (9)