Table of Contents
Fetching ...

Robust Training of Vector Quantized Bottleneck Models

Adrian Łańcucki, Jan Chorowski, Guillaume Sanchez, Ricard Marxer, Nanxin Chen, Hans J. G. A. Dolfing, Sameer Khurana, Tanel Alumäe, Antoine Laurent

TL;DR

This paper focuses on VQ-VAE, a state-of-the-art discrete bottleneck model shown to perform on par with its continuous counterparts, and shows that the codebook learning can suffer from poor initialization and non-stationarity of clustered encoder outputs, but can be successfully overcome by increasing the learning rate.

Abstract

In this paper we demonstrate methods for reliable and efficient training of discrete representation using Vector-Quantized Variational Auto-Encoder models (VQ-VAEs). Discrete latent variable models have been shown to learn nontrivial representations of speech, applicable to unsupervised voice conversion and reaching state-of-the-art performance on unit discovery tasks. For unsupervised representation learning, they became viable alternatives to continuous latent variable models such as the Variational Auto-Encoder (VAE). However, training deep discrete variable models is challenging, due to the inherent non-differentiability of the discretization operation. In this paper we focus on VQ-VAE, a state-of-the-art discrete bottleneck model shown to perform on par with its continuous counterparts. It quantizes encoder outputs with on-line $k$-means clustering. We show that the codebook learning can suffer from poor initialization and non-stationarity of clustered encoder outputs. We demonstrate that these can be successfully overcome by increasing the learning rate for the codebook and periodic date-dependent codeword re-initialization. As a result, we achieve more robust training across different tasks, and significantly increase the usage of latent codewords even for large codebooks. This has practical benefit, for instance, in unsupervised representation learning, where large codebooks may lead to disentanglement of latent representations.

Robust Training of Vector Quantized Bottleneck Models

TL;DR

This paper focuses on VQ-VAE, a state-of-the-art discrete bottleneck model shown to perform on par with its continuous counterparts, and shows that the codebook learning can suffer from poor initialization and non-stationarity of clustered encoder outputs, but can be successfully overcome by increasing the learning rate.

Abstract

In this paper we demonstrate methods for reliable and efficient training of discrete representation using Vector-Quantized Variational Auto-Encoder models (VQ-VAEs). Discrete latent variable models have been shown to learn nontrivial representations of speech, applicable to unsupervised voice conversion and reaching state-of-the-art performance on unit discovery tasks. For unsupervised representation learning, they became viable alternatives to continuous latent variable models such as the Variational Auto-Encoder (VAE). However, training deep discrete variable models is challenging, due to the inherent non-differentiability of the discretization operation. In this paper we focus on VQ-VAE, a state-of-the-art discrete bottleneck model shown to perform on par with its continuous counterparts. It quantizes encoder outputs with on-line -means clustering. We show that the codebook learning can suffer from poor initialization and non-stationarity of clustered encoder outputs. We demonstrate that these can be successfully overcome by increasing the learning rate for the codebook and periodic date-dependent codeword re-initialization. As a result, we achieve more robust training across different tasks, and significantly increase the usage of latent codewords even for large codebooks. This has practical benefit, for instance, in unsupervised representation learning, where large codebooks may lead to disentanglement of latent representations.

Paper Structure

This paper contains 12 sections, 1 theorem, 9 equations, 4 figures, 3 tables, 1 algorithm.

Key Result

Proposition 1

The EMA update rule (eq:ema2) with constant usage counts $N_i=1$ is equivalent to an SGD update for ordinary loss (eq:loss) with a rescaling learning rate $\alpha = (1-\gamma)/2$.

Figures (4)

  • Figure 1: The impact of scale of encoder outputs relative to the scale of codebook words shown in 3-D. (a) Relative scale of codewords $w$ and encoder outputs $e(x)$ impacts performance of mapping bottleneck features to symbols on a subset of ScribbleLens (see Section \ref{['sec:scribble']} for details). (b) If the encoder outputs are larger, multiple codewords are likely to be used. (c) If the encoder outputs are smaller, they tend to cluster and map to fewer codewords.
  • Figure 2: Wall Street Journal Dev93 supervised phoneme error rate (PER)
  • Figure 3: A sample training line from the ScribbleLens corpus Scribble20Dolfing20
  • Figure 4: Unsupervised bits/dim (BPD) on CIFAR-10 test set for $8\times128$-codeword model during training with checkpoint averaging. During initial data-dependent reestimation iterations, BPD is higher for checkpoint-averaged models, because each codebook re-initialization breaks the averages. However, when the reestimation period is over, these models converge faster than others.

Theorems & Definitions (2)

  • Proposition 1
  • proof