Table of Contents
Fetching ...

Representation Collapsing Problems in Vector Quantization

Wenhao Zhao, Qiran Zou, Rushi Shah, Dianbo Liu

TL;DR

This study investigates representation collapse in vector quantization - a critical degradation where codebook tokens or latent embeddings lose their discriminative power by converging to a limited subset of values.

Abstract

Vector quantization is a technique in machine learning that discretizes continuous representations into a set of discrete vectors. It is widely employed in tokenizing data representations for large language models, diffusion models, and other generative models. Despite its prevalence, the characteristics and behaviors of vector quantization in generative models remain largely underexplored. In this study, we investigate representation collapse in vector quantization - a critical degradation where codebook tokens or latent embeddings lose their discriminative power by converging to a limited subset of values. This collapse fundamentally compromises the model's ability to capture diverse data patterns. By leveraging both synthetic and real datasets, we identify the severity of each type of collapses and triggering conditions. Our analysis reveals that restricted initialization and limited encoder capacity result in tokens collapse and embeddings collapse. Building on these findings, we propose potential solutions aimed at mitigating each collapse. To the best of our knowledge, this is the first comprehensive study examining representation collapsing problems in vector quantization.

Representation Collapsing Problems in Vector Quantization

TL;DR

This study investigates representation collapse in vector quantization - a critical degradation where codebook tokens or latent embeddings lose their discriminative power by converging to a limited subset of values.

Abstract

Vector quantization is a technique in machine learning that discretizes continuous representations into a set of discrete vectors. It is widely employed in tokenizing data representations for large language models, diffusion models, and other generative models. Despite its prevalence, the characteristics and behaviors of vector quantization in generative models remain largely underexplored. In this study, we investigate representation collapse in vector quantization - a critical degradation where codebook tokens or latent embeddings lose their discriminative power by converging to a limited subset of values. This collapse fundamentally compromises the model's ability to capture diverse data patterns. By leveraging both synthetic and real datasets, we identify the severity of each type of collapses and triggering conditions. Our analysis reveals that restricted initialization and limited encoder capacity result in tokens collapse and embeddings collapse. Building on these findings, we propose potential solutions aimed at mitigating each collapse. To the best of our knowledge, this is the first comprehensive study examining representation collapsing problems in vector quantization.

Paper Structure

This paper contains 14 sections, 5 equations, 6 figures.

Figures (6)

  • Figure 1: Representation collapse types in vector quantization. On the left, Tokens Collapse is illustrated, where a subset of tokens (shown in red) collapses, leaving fewer codes for other peaks and losing diversity compared to normal embeddings (in grey). On the right, Embeddings Collapse is shown, where a large portion of the embedding space (in green) collapses into a limited set of representations losing out on important information present in other modes. Both phenomena lead to a degradation in the quality of learned representations.
  • Figure 2: Distribution of untrained and trained encoder's output. (a) Untrained encoder's output has fewer peaks than 10 peaks of input and clusters around a relatively small range. (b) Trained encoder's output displays 10 peaks which is the same as the input.
  • Figure 3: Tokens collapse and results of our pretraining solution on synthetic data. The comparison between results with and without our pretraining solution demonstrates that the untrained encoder is able to result in tokens collapse and our pretraining solution is effective.
  • Figure 4: Validation of tokens collapse on CIFAR-10. As the total number of tokens increases, the MSE and perplexity for pretrained VQ-VAE and original VQ-VAE models reveal distinct behaviors. From $2^{12}$, the original VQ-VAE start to suffer from severe token collapse due to dense tokens, causing MSE and perplexity to stagnate. Conversely, the pretrained VQ-VAE addresses this issue, resulting in continually decreasing MSE and increasing perplexity.
  • Figure 5: Embeddings collapse problem on synthetic data. Compared to the encoder with high capacity (hidden size 32), encoder with low capacity (hidden size 4) exhibits embeddings collapse.
  • ...and 1 more figures