Table of Contents
Fetching ...

An Intra-BRNN and GB-RVQ Based END-TO-END Neural Audio Codec

Linping Xu, Jiawei Jiang, Dejun Zhang, Xianjun Xia, Li Chen, Yijian Xiao, Piao Ding, Shenyi Song, Sixing Yin, Ferdous Sohel

TL;DR

The paper targets high-quality speech coding at very low bitrates by introducing CBRC, an end-to-end codec that exploits intra-frame correlations with an interleaved 1D-CNN and Intra-BRNN and reduces quantization noise with Group-wise and Beam-search RVQ. It presents four cascaded CBRNBlocks for the encoder/decoder, a GB-RVQ quantizer architecture, and a composite loss function combining reconstruction, adversarial, perceptual, and VQ terms. Empirical results on LibriTTS show that CBRC at 3 kbps outperforms Opus at 12 kbps and Lyra-V2 at 3.2 kbps, with ablation studies confirming the benefits of Intra-BRNN and the proposed RVQ variants. The work demonstrates state-of-the-art perceptual quality at very low bitrates and highlights practical potential for real-time communication with limited bandwidth.

Abstract

Recently, neural networks have proven to be effective in performing speech coding task at low bitrates. However, under-utilization of intra-frame correlations and the error of quantizer specifically degrade the reconstructed audio quality. To improve the coding quality, we present an end-to-end neural speech codec, namely CBRC (Convolutional and Bidirectional Recurrent neural Codec). An interleaved structure using 1D-CNN and Intra-BRNN is designed to exploit the intra-frame correlations more efficiently. Furthermore, Group-wise and Beam-search Residual Vector Quantizer (GB-RVQ) is used to reduce the quantization noise. CBRC encodes audio every 20ms with no additional latency, which is suitable for real-time communication. Experimental results demonstrate the superiority of the proposed codec when comparing CBRC at 3kbps with Opus at 12kbps.

An Intra-BRNN and GB-RVQ Based END-TO-END Neural Audio Codec

TL;DR

The paper targets high-quality speech coding at very low bitrates by introducing CBRC, an end-to-end codec that exploits intra-frame correlations with an interleaved 1D-CNN and Intra-BRNN and reduces quantization noise with Group-wise and Beam-search RVQ. It presents four cascaded CBRNBlocks for the encoder/decoder, a GB-RVQ quantizer architecture, and a composite loss function combining reconstruction, adversarial, perceptual, and VQ terms. Empirical results on LibriTTS show that CBRC at 3 kbps outperforms Opus at 12 kbps and Lyra-V2 at 3.2 kbps, with ablation studies confirming the benefits of Intra-BRNN and the proposed RVQ variants. The work demonstrates state-of-the-art perceptual quality at very low bitrates and highlights practical potential for real-time communication with limited bandwidth.

Abstract

Recently, neural networks have proven to be effective in performing speech coding task at low bitrates. However, under-utilization of intra-frame correlations and the error of quantizer specifically degrade the reconstructed audio quality. To improve the coding quality, we present an end-to-end neural speech codec, namely CBRC (Convolutional and Bidirectional Recurrent neural Codec). An interleaved structure using 1D-CNN and Intra-BRNN is designed to exploit the intra-frame correlations more efficiently. Furthermore, Group-wise and Beam-search Residual Vector Quantizer (GB-RVQ) is used to reduce the quantization noise. CBRC encodes audio every 20ms with no additional latency, which is suitable for real-time communication. Experimental results demonstrate the superiority of the proposed codec when comparing CBRC at 3kbps with Opus at 12kbps.
Paper Structure (12 sections, 4 equations, 5 figures, 3 tables, 1 algorithm)

This paper contains 12 sections, 4 equations, 5 figures, 3 tables, 1 algorithm.

Figures (5)

  • Figure 1: CBRC model architecture
  • Figure 2: CBRNBlock diagram
  • Figure 3: Group-wise RVQ: The encoded embedding is split into $G$=2 independent embeddings. They are processed in parallel in quantizer.
  • Figure 4: Beam-search RVQ: The figure shows the process of quantization in three layers of VQ when the number of candidates $k$ = 2. Each layer of VQ except the last reserves 2 quantization paths.
  • Figure 5: Subjective scores for different codecs. Error bars denote 95% confidence intervals.