Table of Contents
Fetching ...

Beyond Stationarity: Rethinking Codebook Collapse in Vector Quantization

Hao Lu, Onur C. Koyun, Yongxin Guo, Zhengjie Zhu, Abbas Alili, Metin Nafi Gurcan

TL;DR

Two new methods are proposed: Non-Stationary Vector Quantization (NSVQ), which propagates encoder drift to non-selected codes through a kernel-based rule, and Transformer-based Vector Quantization (TransVQ), which employs a lightweight mapping to adaptively transform the entire codebook while preserving convergence to the k-means solution.

Abstract

Vector Quantization (VQ) underpins many modern generative frameworks such as VQ-VAE, VQ-GAN, and latent diffusion models. Yet, it suffers from the persistent problem of codebook collapse, where a large fraction of code vectors remains unused during training. This work provides a new theoretical explanation by identifying the nonstationary nature of encoder updates as the fundamental cause of this phenomenon. We show that as the encoder drifts, unselected code vectors fail to receive updates and gradually become inactive. To address this, we propose two new methods: Non-Stationary Vector Quantization (NSVQ), which propagates encoder drift to non-selected codes through a kernel-based rule, and Transformer-based Vector Quantization (TransVQ), which employs a lightweight mapping to adaptively transform the entire codebook while preserving convergence to the k-means solution. Experiments on the CelebA-HQ dataset demonstrate that both methods achieve near-complete codebook utilization and superior reconstruction quality compared to baseline VQ variants, providing a principled and scalable foundation for future VQ-based generative models. The code is available at: https://github.com/CAIR- LAB- WFUSM/NSVQ-TransVQ.git

Beyond Stationarity: Rethinking Codebook Collapse in Vector Quantization

TL;DR

Two new methods are proposed: Non-Stationary Vector Quantization (NSVQ), which propagates encoder drift to non-selected codes through a kernel-based rule, and Transformer-based Vector Quantization (TransVQ), which employs a lightweight mapping to adaptively transform the entire codebook while preserving convergence to the k-means solution.

Abstract

Vector Quantization (VQ) underpins many modern generative frameworks such as VQ-VAE, VQ-GAN, and latent diffusion models. Yet, it suffers from the persistent problem of codebook collapse, where a large fraction of code vectors remains unused during training. This work provides a new theoretical explanation by identifying the nonstationary nature of encoder updates as the fundamental cause of this phenomenon. We show that as the encoder drifts, unselected code vectors fail to receive updates and gradually become inactive. To address this, we propose two new methods: Non-Stationary Vector Quantization (NSVQ), which propagates encoder drift to non-selected codes through a kernel-based rule, and Transformer-based Vector Quantization (TransVQ), which employs a lightweight mapping to adaptively transform the entire codebook while preserving convergence to the k-means solution. Experiments on the CelebA-HQ dataset demonstrate that both methods achieve near-complete codebook utilization and superior reconstruction quality compared to baseline VQ variants, providing a principled and scalable foundation for future VQ-based generative models. The code is available at: https://github.com/CAIR- LAB- WFUSM/NSVQ-TransVQ.git
Paper Structure (47 sections, 64 equations, 7 figures, 6 tables)

This paper contains 47 sections, 64 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Illustration of codebook adaptation under non-stationary data. (a) In vanilla VQ, codewords fail to track the drifting data distribution, leading to representation collapse. (b) In the proposed NS-VQ, adaptive variance-controlled updates allow the codebook to follow distributional shifts over time, maintaining coverage and stability. (c) In the proposed TransVQ, a few codewords still lag behind the drifting data distribution, but the projector gradients drive all other codewords to move jointly toward the data, improving overall alignment. In the figures, purple dots denote the target distribution $Y_t$ (least visible), green dots represent the base data $X$, blue dots indicate the current batch $X_b$, and red crosses mark the codebook vectors $C$.
  • Figure 2: Scheme of the proposed (a) Non-Stationary Vector Quantization (NS-VQ) and (b) the Transformer-based Vector Quantization (TransVQ).
  • Figure 3: Comparison of the proposed NS-VQVAE and TransVQVAE with VQGAN-FC RN3 under varying codebook sizes. (a) rFID comparison showing both NS-VQVAE and TransVQVAE consistently reduce reconstruction error compared with VQGAN-FC RN3. (b) Codebook utilization indicating that both proposed methods maintain nearly full codebook usage, effectively preventing codebook collapse.
  • Figure 4: rFID curves of standard VQ-VAE with codebook size and code dimension fixed at 64, evaluated under different batch sizes. Larger batch sizes lead to lower rFID values, consistent with our theoretical analysis that larger batches provide more stable codebook updates.
  • Figure 5: Qualitative reconstruction comparison. Each strip shows the same set of identities. NS-VQ and TransVQ maintain sharp edges and facial micro-structure while reducing artifacts observed in vanilla and EMA-VQ. This aligns with the quantitative gains in rFID/LPIPS/SSIM reported in the main paper.
  • ...and 2 more figures