Table of Contents
Fetching ...

Latent-Domain Predictive Neural Speech Coding

Xue Jiang, Xiulian Peng, Huaying Xue, Yuan Zhang, Yan Lu

TL;DR

This work tackles temporal redundancy in neural speech codecs by introducing latent-domain predictive coding within a VQ-VAE framework, forming TF-Codec, a low-latency end-to-end codec. It combines learnable input compression on time-frequency inputs, a Distance-Gumbel-Softmax vector quantizer with rate control, and adversarial training to achieve high perceptual quality at ultra-low bitrates. Empirical results show TF-Codec outperforms Opus at 1 kbps and EVS/Opus baselines at 3 kbps, while maintaining real-time performance on CPU and robustness to transmission errors. The approach yields interpretable latent representations and demonstrates potential extensions to broader audio signals, including music and expressive speech.

Abstract

Neural audio/speech coding has recently demonstrated its capability to deliver high quality at much lower bitrates than traditional methods. However, existing neural audio/speech codecs employ either acoustic features or learned blind features with a convolutional neural network for encoding, by which there are still temporal redundancies within encoded features. This paper introduces latent-domain predictive coding into the VQ-VAE framework to fully remove such redundancies and proposes the TF-Codec for low-latency neural speech coding in an end-to-end manner. Specifically, the extracted features are encoded conditioned on a prediction from past quantized latent frames so that temporal correlations are further removed. Moreover, we introduce a learnable compression on the time-frequency input to adaptively adjust the attention paid to main frequencies and details at different bitrates. A differentiable vector quantization scheme based on distance-to-soft mapping and Gumbel-Softmax is proposed to better model the latent distributions with rate constraint. Subjective results on multilingual speech datasets show that, with low latency, the proposed TF-Codec at 1 kbps achieves significantly better quality than Opus at 9 kbps, and TF-Codec at 3 kbps outperforms both EVS at 9.6 kbps and Opus at 12 kbps. Numerous studies are conducted to demonstrate the effectiveness of these techniques. Code and models are available at https://github.com/microsoft/TF-Codec.

Latent-Domain Predictive Neural Speech Coding

TL;DR

This work tackles temporal redundancy in neural speech codecs by introducing latent-domain predictive coding within a VQ-VAE framework, forming TF-Codec, a low-latency end-to-end codec. It combines learnable input compression on time-frequency inputs, a Distance-Gumbel-Softmax vector quantizer with rate control, and adversarial training to achieve high perceptual quality at ultra-low bitrates. Empirical results show TF-Codec outperforms Opus at 1 kbps and EVS/Opus baselines at 3 kbps, while maintaining real-time performance on CPU and robustness to transmission errors. The approach yields interpretable latent representations and demonstrates potential extensions to broader audio signals, including music and expressive speech.

Abstract

Neural audio/speech coding has recently demonstrated its capability to deliver high quality at much lower bitrates than traditional methods. However, existing neural audio/speech codecs employ either acoustic features or learned blind features with a convolutional neural network for encoding, by which there are still temporal redundancies within encoded features. This paper introduces latent-domain predictive coding into the VQ-VAE framework to fully remove such redundancies and proposes the TF-Codec for low-latency neural speech coding in an end-to-end manner. Specifically, the extracted features are encoded conditioned on a prediction from past quantized latent frames so that temporal correlations are further removed. Moreover, we introduce a learnable compression on the time-frequency input to adaptively adjust the attention paid to main frequencies and details at different bitrates. A differentiable vector quantization scheme based on distance-to-soft mapping and Gumbel-Softmax is proposed to better model the latent distributions with rate constraint. Subjective results on multilingual speech datasets show that, with low latency, the proposed TF-Codec at 1 kbps achieves significantly better quality than Opus at 9 kbps, and TF-Codec at 3 kbps outperforms both EVS at 9.6 kbps and Opus at 12 kbps. Numerous studies are conducted to demonstrate the effectiveness of these techniques. Code and models are available at https://github.com/microsoft/TF-Codec.
Paper Structure (26 sections, 15 equations, 11 figures, 5 tables)

This paper contains 26 sections, 15 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: Proposed latent-domain predictive neural speech coding.
  • Figure 2: Encoding and decoding modules for proposed method.
  • Figure 3: Architecture of the encoder and the decoder. D-Conv denotes depthwise convolution.
  • Figure 4: The network structure of the predictor.
  • Figure 5: Vector quantization mechanism. (a) Gumbel-Softmax in Gumbelsoftmax. Latent $\bm{x}_t^N$ is projected to logits $z_i$ through a linear projection and turned into probabilities with Gumbel-Softmax. (b) Our Distance-Gumbel-Softmax. Distance between latent $\bm{x}_t^N$ and codewords $\bm{e}_i$ is first calculated and then mapped to probabilities with Gumbel-Softmax.
  • ...and 6 more figures