BigCodec: Pushing the Limits of Low-Bitrate Neural Speech Codec

Detai Xin; Xu Tan; Shinnosuke Takamichi; Hiroshi Saruwatari

BigCodec: Pushing the Limits of Low-Bitrate Neural Speech Codec

Detai Xin, Xu Tan, Shinnosuke Takamichi, Hiroshi Saruwatari

TL;DR

BigCodec tackles the challenge of maintaining perceptual speech quality at ultra-low bitrates by scaling up a neural codec to 159M parameters and employing a single 8192-code VQ-VAE with low-dimensional quantization. Trained on LibriSpeech, it achieves $1.04$ kbps while delivering superior objective and subjective results, often matching or surpassing higher-bitrate baselines and even the ground truth in perceptual quality. Ablation studies show the importance of temporal modeling and large-scale architecture, while the model remains close to real-time on CPU. This work demonstrates that substantial gains at very low bitrates are achievable through model scale, efficient quantization, and GAN-based training, with potential for broader audio applications and further bitrate reductions.

Abstract

We present BigCodec, a low-bitrate neural speech codec. While recent neural speech codecs have shown impressive progress, their performance significantly deteriorates at low bitrates (around 1 kbps). Although a low bitrate inherently restricts performance, other factors, such as model capacity, also hinder further improvements. To address this problem, we scale up the model size to 159M parameters that is more than 10 times larger than popular codecs with about 10M parameters. Besides, we integrate sequential models into traditional convolutional architectures to better capture temporal dependency and adopt low-dimensional vector quantization to ensure a high code utilization. Comprehensive objective and subjective evaluations show that BigCodec, with a bitrate of 1.04 kbps, significantly outperforms several existing low-bitrate codecs. Furthermore, BigCodec achieves objective performance comparable to popular codecs operating at 4-6 times higher bitrates, and even delivers better subjective perceptual quality than the ground truth.

BigCodec: Pushing the Limits of Low-Bitrate Neural Speech Codec

TL;DR

kbps while delivering superior objective and subjective results, often matching or surpassing higher-bitrate baselines and even the ground truth in perceptual quality. Ablation studies show the importance of temporal modeling and large-scale architecture, while the model remains close to real-time on CPU. This work demonstrates that substantial gains at very low bitrates are achievable through model scale, efficient quantization, and GAN-based training, with potential for broader audio applications and further bitrate reductions.

Abstract

Paper Structure (21 sections, 1 figure, 3 tables)

This paper contains 21 sections, 1 figure, 3 tables.

Introduction
Related work
BigCodec
Architecture
VQ-VAE generator
Encoder & Decoder
Vector quantization
Discriminators
Training objective
Reconstruction loss
GAN loss
VQ loss
Weighting
Scaling up the model size
Experiments
...and 6 more sections

Figures (1)

Figure 1: Architecture of the VQ-VAE generator of BigCodec. All symbols are defined in Section \ref{['subsection:architecture']}.

BigCodec: Pushing the Limits of Low-Bitrate Neural Speech Codec

TL;DR

Abstract

BigCodec: Pushing the Limits of Low-Bitrate Neural Speech Codec

Authors

TL;DR

Abstract

Table of Contents

Figures (1)