Table of Contents
Fetching ...

Addressing Index Collapse of Large-Codebook Speech Tokenizer with Dual-Decoding Product-Quantized Variational Auto-Encoder

Haohan Guo, Fenglong Xie, Dongchao Yang, Hui Lu, Xixin Wu, Helen Meng

TL;DR

This paper tackles index collapse in large-codebook VQ-VAEs by introducing a product-quantized VAE (PQ-VAE) that composes a large codebook from multiple smaller codebooks. It couples product quantization with a dual-decoding training strategy to ensure balanced use of all subspaces and improved reconstruction quality. Empirical results show PQ-VAE mitigates index collapse and, when combined with dual decoding, yields higher codebook perplexity and better speech reconstruction, outperforming other multi-codebook approaches; it also demonstrates benefits in TTS when integrating with large-codebook tokenizers for LLM-based speech generation. The findings underscore the practicality of large-codebook speech tokenizers built via PQ-VAE for scalable, high-quality speech synthesis and downstream language-model integration, as evidenced on WenetSpeech and HuBERT-based TTS pipelines. The theoretical and empirical gains are summarized by the large composite codebook size $|C^*| = \prod_{i=0}^{M-1} N_i$, enabling richer discrete representations without conventional index collapse.

Abstract

VQ-VAE, as a mainstream approach of speech tokenizer, has been troubled by ``index collapse'', where only a small number of codewords are activated in large codebooks. This work proposes product-quantized (PQ) VAE with more codebooks but fewer codewords to address this problem and build large-codebook speech tokenizers. It encodes speech features into multiple VQ subspaces and composes them into codewords in a larger codebook. Besides, to utilize each VQ subspace well, we also enhance PQ-VAE via a dual-decoding training strategy with the encoding and quantized sequences. The experimental results demonstrate that PQ-VAE addresses ``index collapse" effectively, especially for larger codebooks. The model with the proposed training strategy further improves codebook perplexity and reconstruction quality, outperforming other multi-codebook VQ approaches. Finally, PQ-VAE demonstrates its effectiveness in language-model-based TTS, supporting higher-quality speech generation with larger codebooks.

Addressing Index Collapse of Large-Codebook Speech Tokenizer with Dual-Decoding Product-Quantized Variational Auto-Encoder

TL;DR

This paper tackles index collapse in large-codebook VQ-VAEs by introducing a product-quantized VAE (PQ-VAE) that composes a large codebook from multiple smaller codebooks. It couples product quantization with a dual-decoding training strategy to ensure balanced use of all subspaces and improved reconstruction quality. Empirical results show PQ-VAE mitigates index collapse and, when combined with dual decoding, yields higher codebook perplexity and better speech reconstruction, outperforming other multi-codebook approaches; it also demonstrates benefits in TTS when integrating with large-codebook tokenizers for LLM-based speech generation. The findings underscore the practicality of large-codebook speech tokenizers built via PQ-VAE for scalable, high-quality speech synthesis and downstream language-model integration, as evidenced on WenetSpeech and HuBERT-based TTS pipelines. The theoretical and empirical gains are summarized by the large composite codebook size , enabling richer discrete representations without conventional index collapse.

Abstract

VQ-VAE, as a mainstream approach of speech tokenizer, has been troubled by ``index collapse'', where only a small number of codewords are activated in large codebooks. This work proposes product-quantized (PQ) VAE with more codebooks but fewer codewords to address this problem and build large-codebook speech tokenizers. It encodes speech features into multiple VQ subspaces and composes them into codewords in a larger codebook. Besides, to utilize each VQ subspace well, we also enhance PQ-VAE via a dual-decoding training strategy with the encoding and quantized sequences. The experimental results demonstrate that PQ-VAE addresses ``index collapse" effectively, especially for larger codebooks. The model with the proposed training strategy further improves codebook perplexity and reconstruction quality, outperforming other multi-codebook VQ approaches. Finally, PQ-VAE demonstrates its effectiveness in language-model-based TTS, supporting higher-quality speech generation with larger codebooks.
Paper Structure (18 sections, 4 equations, 3 figures, 3 tables)

This paper contains 18 sections, 4 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: The framework of PQ-VAE with dual-decoding training strategy.
  • Figure 2: The codebook usage, codebook perplexity, and reconstruction loss (RMSE) of VQ-VAE and PQ-VAE under different total codebook sizes (the horizontal axis).
  • Figure 3: The codebook usage and perplexity in $log_2$ scale, and reconstruction loss (RMSE) of PQ-VAE with or without the proposed dual-decoding training strategy under different total codebook sizes (the horizontal axis).