CodecSlime: Temporal Redundancy Compression of Neural Speech Codec via Dynamic Frame Rate

Hankun Wang; Yiwei Guo; Chongtian Shao; Bohan Li; Kai Yu

CodecSlime: Temporal Redundancy Compression of Neural Speech Codec via Dynamic Frame Rate

Hankun Wang, Yiwei Guo, Chongtian Shao, Bohan Li, Kai Yu

TL;DR

CodecSlime addresses the inefficiency of fixed-frame-rate neural speech codecs by enabling dynamic frame rate through an unsupervised plugin approach. It combines ScheDFR, an inference-time downsampling scheduler, with Melt-and-Cool, a two-stage training recipe to adapt backbones to dynamic frame rates, yielding robust reconstruction at lower frame rates. Empirically, a CodecSlime-integrated 80 Hz backbone achieves up to a 32% relative WER reduction at 40 Hz inferred frame rate compared to a 40 Hz fixed-rate baseline with similar content bitrate, while maintaining competitive perceptual metrics and generalizing across higher frame rates and unseen languages. This work demonstrates that temporal redundancy in speech can be effectively compressed without retraining for each frame-rate target, enabling flexible, high-quality speech tokenization for downstream applications.

Abstract

Neural speech codecs have been widely used in audio compression and various downstream tasks. Current mainstream codecs are fixed-frame-rate (FFR), which allocate the same number of tokens to every equal-duration slice. However, speech is inherently non-uniform in temporal information density. As a result, many tokens are wasted on steady-state segments like long vowels and silences. To address this mismatch, we present CodecSlime, a plugin-style method for compressing temporal redundancy through supporting dynamic frame rate (DFR) on neural speech codecs for the first time. Our method is unsupervised and architecture-agnostic, combining two key innovations, ScheDFR and Melt-and-Cool, for adapting inference and training, respectively. When integrated into a typical VQ-GAN codec backbone and operating at 40 Hz DFR ($\approx$ 600 bps), the reconstruction WER of CodecSlime is reduced by up to 32% relative to conventional FFR baselines with the same model architecture and similar bitrates, while other metrics are also competitive. CodecSlime also enables flexible trade-offs between reconstruction quality and bitrate: a single model supports inference at multiple frame rates and consistently outperforms FFR models at the corresponding frame rates. Audio samples are available at https://acadarmeria.github.io/codecslime/.

CodecSlime: Temporal Redundancy Compression of Neural Speech Codec via Dynamic Frame Rate

TL;DR

Abstract

CodecSlime: Temporal Redundancy Compression of Neural Speech Codec via Dynamic Frame Rate

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)