Low Frame-rate Speech Codec: a Codec Designed for Fast High-quality Speech LLM Training and Inference
Edresson Casanova, Ryan Langman, Paarth Neekhara, Shehzeen Hussain, Jason Li, Subhankar Ghosh, Ante Jukić, Sang-gil Lee
TL;DR
The paper tackles the latency and efficiency bottlenecks of high-frame-rate neural audio codecs in training and inference for Speech LLMs. It introduces the Low Frame-rate Speech Codec (LFSC), a 1.89 kbps, 21.5 FPS neural codec that uses Finite Scalar Quantization and an adversarial setup with a frozen WavLM-based SLM discriminator to maintain quality. Through extensive ablations and a Zero-shot TTS study, LFSC demonstrates competitive perceptual quality, improved intelligibility, and about a threefold speedup in downstream Speech LLM training and inference, with public release in NeMo. The work suggests strong practical impact for scalable, low-latency speech processing and points to extensions to 44 kHz and other audio domains like music.
Abstract
Large language models (LLMs) have significantly advanced audio processing through audio codecs that convert audio into discrete tokens, enabling the application of language modeling techniques to audio data. However, audio codecs often operate at high frame rates, resulting in slow training and inference, especially for autoregressive models. To address this challenge, we present the Low Frame-rate Speech Codec (LFSC): a neural audio codec that leverages finite scalar quantization and adversarial training with large speech language models to achieve high-quality audio compression with a 1.89 kbps bitrate and 21.5 frames per second. We demonstrate that our novel codec can make the inference of LLM-based text-to-speech models around three times faster while improving intelligibility and producing quality comparable to previous models.
