NDVQ: Robust Neural Audio Codec with Normal Distribution-Based Vector Quantization
Zhikang Niu, Sanyuan Chen, Long Zhou, Ziyang Ma, Xie Chen, Shujie Liu
TL;DR
NDVQ tackles noise sensitivity and codebook collapse in neural audio codecs by representing each VQ code as a normal distribution with learnable mean and variance, enabling a safety margin and distribution-based quantization. The encoder–decoder framework uses a residual normal distribution vector quantizer, a multi-scale STFT discriminator, and a carefully crafted loss with a reparameterization-based sampling mechanism, achieving robust performance at very low bitrates. Across LibriTTS and out-of-domain data, NDVQ outperforms EnCodec on perceptual and distortion metrics and shows strong zero-shot TTS improvements, suggesting greater generalization and potential as a universal audio codec for speech. The work highlights the benefits of distribution-based VQ for robustness, codebook utilization, and downstream codec-based synthesis, with future work focusing on broader audio domains and universality.
Abstract
Built upon vector quantization (VQ), discrete audio codec models have achieved great success in audio compression and auto-regressive audio generation. However, existing models face substantial challenges in perceptual quality and signal distortion, especially when operating in extremely low bandwidth, rooted in the sensitivity of the VQ codebook to noise. This degradation poses significant challenges for several downstream tasks, such as codec-based speech synthesis. To address this issue, we propose a novel VQ method, Normal Distribution-based Vector Quantization (NDVQ), by introducing an explicit margin between the VQ codes via learning a variance. Specifically, our approach involves mapping the waveform to a latent space and quantizing it by selecting the most likely normal distribution, with each codebook entry representing a unique normal distribution defined by its mean and variance. Using these distribution-based VQ codec codes, a decoder reconstructs the input waveform. NDVQ is trained with additional distribution-related losses, alongside reconstruction and discrimination losses. Experiments demonstrate that NDVQ outperforms existing audio compression baselines, such as EnCodec, in terms of audio quality and zero-shot TTS, particularly in very low bandwidth scenarios.
