Table of Contents
Fetching ...

Preventing Local Pitfalls in Vector Quantization via Optimal Transport

Borui Zhang, Wenzhao Zheng, Jie Zhou, Jiwen Lu

TL;DR

The paper tackles training instability in vector-quantized networks by identifying local minima in code assignment as the root cause of index collapse. It reframes quantization as an optimal-transport problem between data features ${\bm{Z}}$ and codebook vectors ${\bm{C}}$, solved efficiently with the Sinkhorn-Knopp algorithm and a entropy-regularized objective that includes $H({\bm{A}})$ and a balance parameter $\epsilon$. To ensure robustness across diverse data, a simple normalization of the distance matrix is introduced, and a multi-head quantizer expands the effective codebook size to $n^B$, enabling richer discretization. Empirically, OptVQ achieves 100% codebook utilization, outperforms state-of-the-art VQNs on reconstruction tasks (e.g., ImageNet, MNIST, CIFAR-10), and demonstrates stable training without resorting to subtle initializations or distillation. This work highlights the practical value of global-structure-aware quantization for scalable, stable representation learning in generative and discriminative vision models.

Abstract

Vector-quantized networks (VQNs) have exhibited remarkable performance across various tasks, yet they are prone to training instability, which complicates the training process due to the necessity for techniques such as subtle initialization and model distillation. In this study, we identify the local minima issue as the primary cause of this instability. To address this, we integrate an optimal transport method in place of the nearest neighbor search to achieve a more globally informed assignment. We introduce OptVQ, a novel vector quantization method that employs the Sinkhorn algorithm to optimize the optimal transport problem, thereby enhancing the stability and efficiency of the training process. To mitigate the influence of diverse data distributions on the Sinkhorn algorithm, we implement a straightforward yet effective normalization strategy. Our comprehensive experiments on image reconstruction tasks demonstrate that OptVQ achieves 100% codebook utilization and surpasses current state-of-the-art VQNs in reconstruction quality.

Preventing Local Pitfalls in Vector Quantization via Optimal Transport

TL;DR

The paper tackles training instability in vector-quantized networks by identifying local minima in code assignment as the root cause of index collapse. It reframes quantization as an optimal-transport problem between data features and codebook vectors , solved efficiently with the Sinkhorn-Knopp algorithm and a entropy-regularized objective that includes and a balance parameter . To ensure robustness across diverse data, a simple normalization of the distance matrix is introduced, and a multi-head quantizer expands the effective codebook size to , enabling richer discretization. Empirically, OptVQ achieves 100% codebook utilization, outperforms state-of-the-art VQNs on reconstruction tasks (e.g., ImageNet, MNIST, CIFAR-10), and demonstrates stable training without resorting to subtle initializations or distillation. This work highlights the practical value of global-structure-aware quantization for scalable, stable representation learning in generative and discriminative vision models.

Abstract

Vector-quantized networks (VQNs) have exhibited remarkable performance across various tasks, yet they are prone to training instability, which complicates the training process due to the necessity for techniques such as subtle initialization and model distillation. In this study, we identify the local minima issue as the primary cause of this instability. To address this, we integrate an optimal transport method in place of the nearest neighbor search to achieve a more globally informed assignment. We introduce OptVQ, a novel vector quantization method that employs the Sinkhorn algorithm to optimize the optimal transport problem, thereby enhancing the stability and efficiency of the training process. To mitigate the influence of diverse data distributions on the Sinkhorn algorithm, we implement a straightforward yet effective normalization strategy. Our comprehensive experiments on image reconstruction tasks demonstrate that OptVQ achieves 100% codebook utilization and surpasses current state-of-the-art VQNs in reconstruction quality.

Paper Structure

This paper contains 35 sections, 12 equations, 12 figures, 4 tables, 1 algorithm.

Figures (12)

  • Figure 1: Comparison between different VQ methods. (a) Vanilla VQ employs the nearest neighbor search for quantization, which is a greedy quantization strategy. (b) OptVQ considers vector quantization as an optimal transport problem, which utilizes global information between data for quantization. (c) OptVQ achieves 100% codebook utilization and outperforms other counterparts in image reconstruction tasks.
  • Figure 2: Optimization process of different quantization methods. (a) Vanilla VQ are significantly impacted by initialization, and features are trapped in the Voronoi cell $\Omega_i$. (b) The proposed OptVQ can escape local dilemmas and achieve global-aware indexing.
  • Figure 3: Training details of OptVQ. (a) The iterative Sinkhorn algorithm efficiently resolves the optimal transport problem. (b) The original Sinkhorn method exhibits sensitivity to input values, necessitating normalization. (c) The multi-head mechanism significantly amplifies the effective number of codebooks.
  • Figure 4: Visualizations of reconstruction results on ImageNet validation set (detailed comparison marked in red boxes).
  • Figure 5: Visualizations on MNIST and CIFAR-10.
  • ...and 7 more figures