Table of Contents
Fetching ...

A Real-Time Wideband Neural Vocoder at 1.6 kb/s Using LPCNet

Jean-Marc Valin, Jan Skoglund

TL;DR

This work presents LPCNet, a real-time, ultra-low-bitrate neural vocoder achieving 1.6 kb/s by combining linear prediction with sparse recurrent networks. The architecture employs frame-rate conditioning and sample-rate networks, along with carefully designed quantization for pitch and cepstrum, and 40 ms packetization to maintain low latency. Through training with noise injection, domain adaptation, and data augmentation, the approach delivers quality surpassing MELP and comparable to higher-bitrate waveform codecs in unquantized form, demonstrating the practicality of neural synthesis for ultra-low bitrate speech coding. The results suggest significant potential for neural post-filtering and higher-bitrate exploration to further close the gap to uncompressed speech while preserving real-time deployment on mobile devices.

Abstract

Neural speech synthesis algorithms are a promising new approach for coding speech at very low bitrate. They have so far demonstrated quality that far exceeds traditional vocoders, at the cost of very high complexity. In this work, we present a low-bitrate neural vocoder based on the LPCNet model. The use of linear prediction and sparse recurrent networks makes it possible to achieve real-time operation on general-purpose hardware. We demonstrate that LPCNet operating at 1.6 kb/s achieves significantly higher quality than MELP and that uncompressed LPCNet can exceed the quality of a waveform codec operating at low bitrate. This opens the way for new codec designs based on neural synthesis models.

A Real-Time Wideband Neural Vocoder at 1.6 kb/s Using LPCNet

TL;DR

This work presents LPCNet, a real-time, ultra-low-bitrate neural vocoder achieving 1.6 kb/s by combining linear prediction with sparse recurrent networks. The architecture employs frame-rate conditioning and sample-rate networks, along with carefully designed quantization for pitch and cepstrum, and 40 ms packetization to maintain low latency. Through training with noise injection, domain adaptation, and data augmentation, the approach delivers quality surpassing MELP and comparable to higher-bitrate waveform codecs in unquantized form, demonstrating the practicality of neural synthesis for ultra-low bitrate speech coding. The results suggest significant potential for neural post-filtering and higher-bitrate exploration to further close the gap to uncompressed speech while preserving real-time deployment on mobile devices.

Abstract

Neural speech synthesis algorithms are a promising new approach for coding speech at very low bitrate. They have so far demonstrated quality that far exceeds traditional vocoders, at the cost of very high complexity. In this work, we present a low-bitrate neural vocoder based on the LPCNet model. The use of linear prediction and sparse recurrent networks makes it possible to achieve real-time operation on general-purpose hardware. We demonstrate that LPCNet operating at 1.6 kb/s achieves significantly higher quality than MELP and that uncompressed LPCNet can exceed the quality of a waveform codec operating at low bitrate. This opens the way for new codec designs based on neural synthesis models.

Paper Structure

This paper contains 12 sections, 6 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Overview of the LPCNet model. The frame rate network (yellow) operates on 10-ms frames and its output is held constant through each frame for the sample rate network (blue). The compute prediction block applies linear prediction to predict the sample at time $t$ from the previous samples. Conversions between $\mu$-law and linear are omitted for clarity. The de-emphasis filter is applied to the output $s_{t}$.
  • Figure 2: Prediction and quantization of the cepstrum for packet $k$. Vectors in green are quantized independently, vectors in blue are quantized with prediction, and vectors in red use prediction with no residual quantization. Prediction is shown by the arrows.
  • Figure 3: Noise injection during the training procedure, with $Q$ denoting $\mu$-law quantization and $Q^{-1}$ denoting conversion from $\mu$-law to linear. The prediction filter filter is given by $P\left(z\right)=\sum_{i=1}^{M}a_{i}z^{-k}$. The target excitation is computed as the difference between the clean, unquantized input and the noisy prediction. Note that the noise is added in the $\mu$-law domain so that its power follows that of the real excitation signal.
  • Figure 4: Subjective quality (MUSHRA) results for both listening tests. Set 1 is taken from the NTT database, while Set 2 consists of Opus testvector samples.