Latent-Domain Predictive Neural Speech Coding
Xue Jiang, Xiulian Peng, Huaying Xue, Yuan Zhang, Yan Lu
TL;DR
This work tackles temporal redundancy in neural speech codecs by introducing latent-domain predictive coding within a VQ-VAE framework, forming TF-Codec, a low-latency end-to-end codec. It combines learnable input compression on time-frequency inputs, a Distance-Gumbel-Softmax vector quantizer with rate control, and adversarial training to achieve high perceptual quality at ultra-low bitrates. Empirical results show TF-Codec outperforms Opus at 1 kbps and EVS/Opus baselines at 3 kbps, while maintaining real-time performance on CPU and robustness to transmission errors. The approach yields interpretable latent representations and demonstrates potential extensions to broader audio signals, including music and expressive speech.
Abstract
Neural audio/speech coding has recently demonstrated its capability to deliver high quality at much lower bitrates than traditional methods. However, existing neural audio/speech codecs employ either acoustic features or learned blind features with a convolutional neural network for encoding, by which there are still temporal redundancies within encoded features. This paper introduces latent-domain predictive coding into the VQ-VAE framework to fully remove such redundancies and proposes the TF-Codec for low-latency neural speech coding in an end-to-end manner. Specifically, the extracted features are encoded conditioned on a prediction from past quantized latent frames so that temporal correlations are further removed. Moreover, we introduce a learnable compression on the time-frequency input to adaptively adjust the attention paid to main frequencies and details at different bitrates. A differentiable vector quantization scheme based on distance-to-soft mapping and Gumbel-Softmax is proposed to better model the latent distributions with rate constraint. Subjective results on multilingual speech datasets show that, with low latency, the proposed TF-Codec at 1 kbps achieves significantly better quality than Opus at 9 kbps, and TF-Codec at 3 kbps outperforms both EVS at 9.6 kbps and Opus at 12 kbps. Numerous studies are conducted to demonstrate the effectiveness of these techniques. Code and models are available at https://github.com/microsoft/TF-Codec.
