Table of Contents
Fetching ...

Addressing Representation Collapse in Vector Quantized Models with One Linear Layer

Yongxin Zhu, Bocheng Li, Yifei Xin, Zhihua Xia, Linli Xu

TL;DR

This work identifies representation collapse in vector quantized models as stemming from disjoint codebook optimization, where only a subset of code vectors is updated during training. It proposes SimVQ, a simple method that reparameterizes the codebook through a learnable latent linear transformation, allowing joint optimization of the latent space and preventing collapse without reducing model capacity. Across image and audio experiments, SimVQ achieves near-complete codebook utilization, superior reconstruction metrics, and robust scalability to very large codebooks, while offering favorable memory efficiency. The results suggest that updating the latent space rather than individual codes is a key principle for stabilizing VQ representations and enabling scalable discrete tokenization for multimodal learning.

Abstract

Vector Quantization (VQ) is essential for discretizing continuous representations in unsupervised learning but suffers from representation collapse, causing low codebook utilization and limiting scalability. Existing solutions often rely on complex optimizations or reduce latent dimensionality, which compromises model capacity and fails to fully solve the problem. We identify the root cause as disjoint codebook optimization, where only a few code vectors are updated via gradient descent. To fix this, we propose \textbf{Sim}ple\textbf{VQ}, which reparameterizes code vectors through a learnable linear transformation layer over a latent basis, optimizing the \textit{entire linear space} rather than nearest \textit{individual code vectors}. Although the multiplication of two linear matrices is equivalent to applying a single linear layer, this simple approach effectively prevents collapse. Extensive experiments on image and audio tasks demonstrate that SimVQ improves codebook usage, is easy to implement, and generalizes well across modalities and architectures. The code is available at https://github.com/youngsheen/SimVQ.

Addressing Representation Collapse in Vector Quantized Models with One Linear Layer

TL;DR

This work identifies representation collapse in vector quantized models as stemming from disjoint codebook optimization, where only a subset of code vectors is updated during training. It proposes SimVQ, a simple method that reparameterizes the codebook through a learnable latent linear transformation, allowing joint optimization of the latent space and preventing collapse without reducing model capacity. Across image and audio experiments, SimVQ achieves near-complete codebook utilization, superior reconstruction metrics, and robust scalability to very large codebooks, while offering favorable memory efficiency. The results suggest that updating the latent space rather than individual codes is a key principle for stabilizing VQ representations and enabling scalable discrete tokenization for multimodal learning.

Abstract

Vector Quantization (VQ) is essential for discretizing continuous representations in unsupervised learning but suffers from representation collapse, causing low codebook utilization and limiting scalability. Existing solutions often rely on complex optimizations or reduce latent dimensionality, which compromises model capacity and fails to fully solve the problem. We identify the root cause as disjoint codebook optimization, where only a few code vectors are updated via gradient descent. To fix this, we propose \textbf{Sim}ple\textbf{VQ}, which reparameterizes code vectors through a learnable linear transformation layer over a latent basis, optimizing the \textit{entire linear space} rather than nearest \textit{individual code vectors}. Although the multiplication of two linear matrices is equivalent to applying a single linear layer, this simple approach effectively prevents collapse. Extensive experiments on image and audio tasks demonstrate that SimVQ improves codebook usage, is easy to implement, and generalizes well across modalities and architectures. The code is available at https://github.com/youngsheen/SimVQ.

Paper Structure

This paper contains 32 sections, 16 equations, 10 figures, 5 tables, 1 algorithm.

Figures (10)

  • Figure 1: Comparison of Vanilla VQ and SimVQ. (a): (left) Disjoint optimization in Vanilla VQ. Only the nearest codes are updated, resulting in a high percentage of "dead" codes that are not updated. (b): (right) Joint optimization in SimVQ. The entire codebook is updated with a latent basis, ensuring all codes remain active.
  • Figure 2: (a): (left) The optimization trajectory of the objective $\|\bm{x}-\bm{q}\|^2_2$, which is the same as vanilla VQ. Only a small fraction of points are updated while others remain inactive. (b): (right) The optimization trajectory of the objective $\|\bm{x}-\bm{q}\bm{w}\|^2_2$ with $\bm{q}$ frozen, which is the same as SimVQ. All the points are updated towards targets $x$.
  • Figure 3: (a): (left) The optimization trajectory of the optimization objective: $\|\bm{x}-\bm{q}\bm{w}\|^2_2$ with both $\bm{q}$ and $\bm{w}$ unfrozen. (b): (right) The Frobenius norm of the projection matrix $\bm{w}$ and loss curves. The loss quickly converges to 0 with $\bm{w}$ almost unchanged.
  • Figure 4: (a):(left) The rank of latent basis matrix $\bm{W}$ over training epochs. (b):(right) The Frobenius norm of latent basis matrix $\bm{W}$ over training epochs.
  • Figure 5: Visualization of the divergence between the encoder features and codebook embeddings on a random subset of ImageNet validation dataset. The left figure is vanilla VQ model and the right one is SimVQ.
  • ...and 5 more figures

Theorems & Definitions (1)

  • Remark 1