Table of Contents
Fetching ...

NDVQ: Robust Neural Audio Codec with Normal Distribution-Based Vector Quantization

Zhikang Niu, Sanyuan Chen, Long Zhou, Ziyang Ma, Xie Chen, Shujie Liu

TL;DR

NDVQ tackles noise sensitivity and codebook collapse in neural audio codecs by representing each VQ code as a normal distribution with learnable mean and variance, enabling a safety margin and distribution-based quantization. The encoder–decoder framework uses a residual normal distribution vector quantizer, a multi-scale STFT discriminator, and a carefully crafted loss with a reparameterization-based sampling mechanism, achieving robust performance at very low bitrates. Across LibriTTS and out-of-domain data, NDVQ outperforms EnCodec on perceptual and distortion metrics and shows strong zero-shot TTS improvements, suggesting greater generalization and potential as a universal audio codec for speech. The work highlights the benefits of distribution-based VQ for robustness, codebook utilization, and downstream codec-based synthesis, with future work focusing on broader audio domains and universality.

Abstract

Built upon vector quantization (VQ), discrete audio codec models have achieved great success in audio compression and auto-regressive audio generation. However, existing models face substantial challenges in perceptual quality and signal distortion, especially when operating in extremely low bandwidth, rooted in the sensitivity of the VQ codebook to noise. This degradation poses significant challenges for several downstream tasks, such as codec-based speech synthesis. To address this issue, we propose a novel VQ method, Normal Distribution-based Vector Quantization (NDVQ), by introducing an explicit margin between the VQ codes via learning a variance. Specifically, our approach involves mapping the waveform to a latent space and quantizing it by selecting the most likely normal distribution, with each codebook entry representing a unique normal distribution defined by its mean and variance. Using these distribution-based VQ codec codes, a decoder reconstructs the input waveform. NDVQ is trained with additional distribution-related losses, alongside reconstruction and discrimination losses. Experiments demonstrate that NDVQ outperforms existing audio compression baselines, such as EnCodec, in terms of audio quality and zero-shot TTS, particularly in very low bandwidth scenarios.

NDVQ: Robust Neural Audio Codec with Normal Distribution-Based Vector Quantization

TL;DR

NDVQ tackles noise sensitivity and codebook collapse in neural audio codecs by representing each VQ code as a normal distribution with learnable mean and variance, enabling a safety margin and distribution-based quantization. The encoder–decoder framework uses a residual normal distribution vector quantizer, a multi-scale STFT discriminator, and a carefully crafted loss with a reparameterization-based sampling mechanism, achieving robust performance at very low bitrates. Across LibriTTS and out-of-domain data, NDVQ outperforms EnCodec on perceptual and distortion metrics and shows strong zero-shot TTS improvements, suggesting greater generalization and potential as a universal audio codec for speech. The work highlights the benefits of distribution-based VQ for robustness, codebook utilization, and downstream codec-based synthesis, with future work focusing on broader audio domains and universality.

Abstract

Built upon vector quantization (VQ), discrete audio codec models have achieved great success in audio compression and auto-regressive audio generation. However, existing models face substantial challenges in perceptual quality and signal distortion, especially when operating in extremely low bandwidth, rooted in the sensitivity of the VQ codebook to noise. This degradation poses significant challenges for several downstream tasks, such as codec-based speech synthesis. To address this issue, we propose a novel VQ method, Normal Distribution-based Vector Quantization (NDVQ), by introducing an explicit margin between the VQ codes via learning a variance. Specifically, our approach involves mapping the waveform to a latent space and quantizing it by selecting the most likely normal distribution, with each codebook entry representing a unique normal distribution defined by its mean and variance. Using these distribution-based VQ codec codes, a decoder reconstructs the input waveform. NDVQ is trained with additional distribution-related losses, alongside reconstruction and discrimination losses. Experiments demonstrate that NDVQ outperforms existing audio compression baselines, such as EnCodec, in terms of audio quality and zero-shot TTS, particularly in very low bandwidth scenarios.
Paper Structure (18 sections, 9 equations, 1 figure, 6 tables, 1 algorithm)

This paper contains 18 sections, 9 equations, 1 figure, 6 tables, 1 algorithm.

Figures (1)

  • Figure 1: NDVQ: an encoder-decoder based robust neural audio codec that utilizes normal distribution-based vector quantization. NDVQ selects the most similar probability distribution to quantize the encoder output and employs a re-parameterization trick to obtain quantization results from the codebook data mean and variance. The discriminator is utilized only during training.