Table of Contents
Fetching ...

CodecSlime: Temporal Redundancy Compression of Neural Speech Codec via Dynamic Frame Rate

Hankun Wang, Yiwei Guo, Chongtian Shao, Bohan Li, Kai Yu

TL;DR

CodecSlime addresses the inefficiency of fixed-frame-rate neural speech codecs by enabling dynamic frame rate through an unsupervised plugin approach. It combines ScheDFR, an inference-time downsampling scheduler, with Melt-and-Cool, a two-stage training recipe to adapt backbones to dynamic frame rates, yielding robust reconstruction at lower frame rates. Empirically, a CodecSlime-integrated 80 Hz backbone achieves up to a 32% relative WER reduction at 40 Hz inferred frame rate compared to a 40 Hz fixed-rate baseline with similar content bitrate, while maintaining competitive perceptual metrics and generalizing across higher frame rates and unseen languages. This work demonstrates that temporal redundancy in speech can be effectively compressed without retraining for each frame-rate target, enabling flexible, high-quality speech tokenization for downstream applications.

Abstract

Neural speech codecs have been widely used in audio compression and various downstream tasks. Current mainstream codecs are fixed-frame-rate (FFR), which allocate the same number of tokens to every equal-duration slice. However, speech is inherently non-uniform in temporal information density. As a result, many tokens are wasted on steady-state segments like long vowels and silences. To address this mismatch, we present CodecSlime, a plugin-style method for compressing temporal redundancy through supporting dynamic frame rate (DFR) on neural speech codecs for the first time. Our method is unsupervised and architecture-agnostic, combining two key innovations, ScheDFR and Melt-and-Cool, for adapting inference and training, respectively. When integrated into a typical VQ-GAN codec backbone and operating at 40 Hz DFR ($\approx$ 600 bps), the reconstruction WER of CodecSlime is reduced by up to 32% relative to conventional FFR baselines with the same model architecture and similar bitrates, while other metrics are also competitive. CodecSlime also enables flexible trade-offs between reconstruction quality and bitrate: a single model supports inference at multiple frame rates and consistently outperforms FFR models at the corresponding frame rates. Audio samples are available at https://acadarmeria.github.io/codecslime/.

CodecSlime: Temporal Redundancy Compression of Neural Speech Codec via Dynamic Frame Rate

TL;DR

CodecSlime addresses the inefficiency of fixed-frame-rate neural speech codecs by enabling dynamic frame rate through an unsupervised plugin approach. It combines ScheDFR, an inference-time downsampling scheduler, with Melt-and-Cool, a two-stage training recipe to adapt backbones to dynamic frame rates, yielding robust reconstruction at lower frame rates. Empirically, a CodecSlime-integrated 80 Hz backbone achieves up to a 32% relative WER reduction at 40 Hz inferred frame rate compared to a 40 Hz fixed-rate baseline with similar content bitrate, while maintaining competitive perceptual metrics and generalizing across higher frame rates and unseen languages. This work demonstrates that temporal redundancy in speech can be effectively compressed without retraining for each frame-rate target, enabling flexible, high-quality speech tokenization for downstream applications.

Abstract

Neural speech codecs have been widely used in audio compression and various downstream tasks. Current mainstream codecs are fixed-frame-rate (FFR), which allocate the same number of tokens to every equal-duration slice. However, speech is inherently non-uniform in temporal information density. As a result, many tokens are wasted on steady-state segments like long vowels and silences. To address this mismatch, we present CodecSlime, a plugin-style method for compressing temporal redundancy through supporting dynamic frame rate (DFR) on neural speech codecs for the first time. Our method is unsupervised and architecture-agnostic, combining two key innovations, ScheDFR and Melt-and-Cool, for adapting inference and training, respectively. When integrated into a typical VQ-GAN codec backbone and operating at 40 Hz DFR ( 600 bps), the reconstruction WER of CodecSlime is reduced by up to 32% relative to conventional FFR baselines with the same model architecture and similar bitrates, while other metrics are also competitive. CodecSlime also enables flexible trade-offs between reconstruction quality and bitrate: a single model supports inference at multiple frame rates and consistently outperforms FFR models at the corresponding frame rates. Audio samples are available at https://acadarmeria.github.io/codecslime/.

Paper Structure

This paper contains 30 sections, 3 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Comparison of: (a) conventional 40Hz fixed-rate model, (b) CodecSlime-integrated model, which combines Melt-and-Cool training with ScheDFR for inference, achieving the lowest WER.
  • Figure 2: Overview of the Melt-and-Cool training recipe.
  • Figure 3: WER and PESQ across frame rates for CodecSlime (one single model) and FFR baselines (different models for different frame rates). Although all models shown in the figure use FSQ, VQ-version models also show similar trends.