Table of Contents
Fetching ...

PURE Codec: Progressive Unfolding of Residual Entropy for Speech Codec Learning

Jiatong Shi, Haoran Wang, William Chen, Chenda Li, Wangyou Zhang, Jinchuan Tian, Shinji Watanabe

TL;DR

PURE Codec tackles instability and redundancy in residual vector quantization for neural speech codecs by anchoring the first quantization stage to low-entropy, enhanced speech embeddings and progressively refining higher-entropy residuals. The approach combines enhancement-guided supervision with a two-stage training regime (VAE pretraining and stochastic enhancement scheduling within a GAN framework) to stabilize optimization and improve both reconstruction and downstream SpeechLM-based generation. Across multiple datasets and noisier training conditions, PURE delivers superior fidelity, robustness, and intelligibility gains, with ablations validating the importance of enhancement timing and scheduling. The work advances low-bitrate speech coding with practical impact for robust real-time communication and speech-driven synthesis, while outlining future work to extend the paradigm beyond speech-specific enhancement models.

Abstract

Neural speech codecs have achieved strong performance in low-bitrate compression, but residual vector quantization (RVQ) often suffers from unstable training and ineffective decomposition, limiting reconstruction quality and efficiency. We propose PURE Codec (Progressive Unfolding of Residual Entropy), a novel framework that guides multi-stage quantization using a pre-trained speech enhancement model. The first quantization stage reconstructs low-entropy, denoised speech embeddings, while subsequent stages encode residual high-entropy components. This design improves training stability significantly. Experiments demonstrate that PURE consistently outperforms conventional RVQ-based codecs in reconstruction and downstream speech language model-based text-to-speech, particularly under noisy training conditions.

PURE Codec: Progressive Unfolding of Residual Entropy for Speech Codec Learning

TL;DR

PURE Codec tackles instability and redundancy in residual vector quantization for neural speech codecs by anchoring the first quantization stage to low-entropy, enhanced speech embeddings and progressively refining higher-entropy residuals. The approach combines enhancement-guided supervision with a two-stage training regime (VAE pretraining and stochastic enhancement scheduling within a GAN framework) to stabilize optimization and improve both reconstruction and downstream SpeechLM-based generation. Across multiple datasets and noisier training conditions, PURE delivers superior fidelity, robustness, and intelligibility gains, with ablations validating the importance of enhancement timing and scheduling. The work advances low-bitrate speech coding with practical impact for robust real-time communication and speech-driven synthesis, while outlining future work to extend the paradigm beyond speech-specific enhancement models.

Abstract

Neural speech codecs have achieved strong performance in low-bitrate compression, but residual vector quantization (RVQ) often suffers from unstable training and ineffective decomposition, limiting reconstruction quality and efficiency. We propose PURE Codec (Progressive Unfolding of Residual Entropy), a novel framework that guides multi-stage quantization using a pre-trained speech enhancement model. The first quantization stage reconstructs low-entropy, denoised speech embeddings, while subsequent stages encode residual high-entropy components. This design improves training stability significantly. Experiments demonstrate that PURE consistently outperforms conventional RVQ-based codecs in reconstruction and downstream speech language model-based text-to-speech, particularly under noisy training conditions.

Paper Structure

This paper contains 17 sections, 13 equations, 1 figure, 4 tables.

Figures (1)

  • Figure 1: PURE Codec framework. The input waveform $S$ is optionally enhanced to produce $S^{\text{enh}}$, guiding the first quantization stage via low-entropy embeddings. The encoder output is refined through multiple quantization streams and decoded to reconstruct $\hat{S}$. L1 losses and adversarial losses from discriminators supervise training. Refer to Sec. \ref{['sec: purecodec']} for details.