A Streamable Neural Audio Codec with Residual Scalar-Vector Quantization for Real-Time Communication
Xiao-Hang Jiang, Yang Ai, Rui-Chen Zheng, Zhen-Hua Ling
TL;DR
This work targets real-time neural audio coding by addressing latency and codebook utilization in causal architectures. It introduces RSVQ, a hierarchical residual quantization that combines $N_s$ scalar quantizers and $N_v$ improved vector quantizers within a fully causal MDCT-domain framework, and trains with codebook clustering and balancing to maximize utilization. Empirically, StreamCodec achieves ViSQOL of $4.30$ at $1.5$ kbps on LibriTTS at 16 kHz, with a fixed latency of $20$ ms, nearly $100\times$ real-time generation on GPU and $20\times$ on CPU, and $7.21$M parameters, closely matching non-streamable baselines while enabling real-time use. The approach demonstrates high coding quality, efficiency, and practicality for real-time communication systems.
Abstract
This paper proposes StreamCodec, a streamable neural audio codec designed for real-time communication. StreamCodec adopts a fully causal, symmetric encoder-decoder structure and operates in the modified discrete cosine transform (MDCT) domain, aiming for low-latency inference and real-time efficient generation. To improve codebook utilization efficiency and compensate for the audio quality loss caused by structural causality, StreamCodec introduces a novel residual scalar-vector quantizer (RSVQ). The RSVQ sequentially connects scalar quantizers and improved vector quantizers in a residual manner, constructing coarse audio contours and refining acoustic details, respectively. Experimental results confirm that the proposed StreamCodec achieves decoded audio quality comparable to advanced non-streamable neural audio codecs. Specifically, on the 16 kHz LibriTTS dataset, StreamCodec attains a ViSQOL score of 4.30 at 1.5 kbps. It has a fixed latency of only 20 ms and achieves a generation speed nearly 20 times real-time on a CPU, with a lightweight model size of just 7M parameters, making it highly suitable for real-time communication applications.
