Table of Contents
Fetching ...

BigCodec: Pushing the Limits of Low-Bitrate Neural Speech Codec

Detai Xin, Xu Tan, Shinnosuke Takamichi, Hiroshi Saruwatari

TL;DR

BigCodec tackles the challenge of maintaining perceptual speech quality at ultra-low bitrates by scaling up a neural codec to 159M parameters and employing a single 8192-code VQ-VAE with low-dimensional quantization. Trained on LibriSpeech, it achieves $1.04$ kbps while delivering superior objective and subjective results, often matching or surpassing higher-bitrate baselines and even the ground truth in perceptual quality. Ablation studies show the importance of temporal modeling and large-scale architecture, while the model remains close to real-time on CPU. This work demonstrates that substantial gains at very low bitrates are achievable through model scale, efficient quantization, and GAN-based training, with potential for broader audio applications and further bitrate reductions.

Abstract

We present BigCodec, a low-bitrate neural speech codec. While recent neural speech codecs have shown impressive progress, their performance significantly deteriorates at low bitrates (around 1 kbps). Although a low bitrate inherently restricts performance, other factors, such as model capacity, also hinder further improvements. To address this problem, we scale up the model size to 159M parameters that is more than 10 times larger than popular codecs with about 10M parameters. Besides, we integrate sequential models into traditional convolutional architectures to better capture temporal dependency and adopt low-dimensional vector quantization to ensure a high code utilization. Comprehensive objective and subjective evaluations show that BigCodec, with a bitrate of 1.04 kbps, significantly outperforms several existing low-bitrate codecs. Furthermore, BigCodec achieves objective performance comparable to popular codecs operating at 4-6 times higher bitrates, and even delivers better subjective perceptual quality than the ground truth.

BigCodec: Pushing the Limits of Low-Bitrate Neural Speech Codec

TL;DR

BigCodec tackles the challenge of maintaining perceptual speech quality at ultra-low bitrates by scaling up a neural codec to 159M parameters and employing a single 8192-code VQ-VAE with low-dimensional quantization. Trained on LibriSpeech, it achieves kbps while delivering superior objective and subjective results, often matching or surpassing higher-bitrate baselines and even the ground truth in perceptual quality. Ablation studies show the importance of temporal modeling and large-scale architecture, while the model remains close to real-time on CPU. This work demonstrates that substantial gains at very low bitrates are achievable through model scale, efficient quantization, and GAN-based training, with potential for broader audio applications and further bitrate reductions.

Abstract

We present BigCodec, a low-bitrate neural speech codec. While recent neural speech codecs have shown impressive progress, their performance significantly deteriorates at low bitrates (around 1 kbps). Although a low bitrate inherently restricts performance, other factors, such as model capacity, also hinder further improvements. To address this problem, we scale up the model size to 159M parameters that is more than 10 times larger than popular codecs with about 10M parameters. Besides, we integrate sequential models into traditional convolutional architectures to better capture temporal dependency and adopt low-dimensional vector quantization to ensure a high code utilization. Comprehensive objective and subjective evaluations show that BigCodec, with a bitrate of 1.04 kbps, significantly outperforms several existing low-bitrate codecs. Furthermore, BigCodec achieves objective performance comparable to popular codecs operating at 4-6 times higher bitrates, and even delivers better subjective perceptual quality than the ground truth.
Paper Structure (21 sections, 1 figure, 3 tables)

This paper contains 21 sections, 1 figure, 3 tables.

Figures (1)

  • Figure 1: Architecture of the VQ-VAE generator of BigCodec. All symbols are defined in Section \ref{['subsection:architecture']}.