PURE Codec: Progressive Unfolding of Residual Entropy for Speech Codec Learning
Jiatong Shi, Haoran Wang, William Chen, Chenda Li, Wangyou Zhang, Jinchuan Tian, Shinji Watanabe
TL;DR
PURE Codec tackles instability and redundancy in residual vector quantization for neural speech codecs by anchoring the first quantization stage to low-entropy, enhanced speech embeddings and progressively refining higher-entropy residuals. The approach combines enhancement-guided supervision with a two-stage training regime (VAE pretraining and stochastic enhancement scheduling within a GAN framework) to stabilize optimization and improve both reconstruction and downstream SpeechLM-based generation. Across multiple datasets and noisier training conditions, PURE delivers superior fidelity, robustness, and intelligibility gains, with ablations validating the importance of enhancement timing and scheduling. The work advances low-bitrate speech coding with practical impact for robust real-time communication and speech-driven synthesis, while outlining future work to extend the paradigm beyond speech-specific enhancement models.
Abstract
Neural speech codecs have achieved strong performance in low-bitrate compression, but residual vector quantization (RVQ) often suffers from unstable training and ineffective decomposition, limiting reconstruction quality and efficiency. We propose PURE Codec (Progressive Unfolding of Residual Entropy), a novel framework that guides multi-stage quantization using a pre-trained speech enhancement model. The first quantization stage reconstructs low-entropy, denoised speech embeddings, while subsequent stages encode residual high-entropy components. This design improves training stability significantly. Experiments demonstrate that PURE consistently outperforms conventional RVQ-based codecs in reconstruction and downstream speech language model-based text-to-speech, particularly under noisy training conditions.
