Table of Contents
Fetching ...

FlexiCodec: A Dynamic Neural Audio Codec for Low Frame Rates

Jiaqi Li, Yao Qian, Yuxuan Hu, Leying Zhang, Xiaofei Wang, Heng Lu, Manthan Thakker, Jinyu Li, Sheng Zhao, Zhizheng Wu

TL;DR

FlexiCodec addresses the challenge of preserving semantic information in neural audio codecs operating at very low frame rates. It introduces a dynamic frame-rate scheme with an ASR-guided dual-stream encoder, frame-merging/unmerging transformers, and FSQ/RVQ quantization to produce controllable frame rates between 3 and 12.5 Hz while maintaining high semantic fidelity. The approach demonstrates superior semantic preservation at low rates (e.g., 6.25 Hz) and competitive acoustic quality across bitrate- and frame-rate-manned baselines, with demonstrated benefits in downstream TTS and audio understanding tasks. This work enables efficient, flexible audio representations for LM-based systems and edge deployments, offering substantial speedups without sacrificing core intelligibility or perceptual quality.

Abstract

Neural audio codecs are foundational to speech language models. It is expected to have a low frame rate and decoupled semantic and acoustic information. A lower frame rate codec can reduce the computational cost of speech language models by shortening the sequence length. Recent studies have developed 12.5Hz low-frame-rate audio codecs, but even lower frame rate codecs remain underexplored. We find that a major challenge for very low frame rate tokens is missing semantic information. This paper introduces FlexiCodec to address this limitation. FlexiCodec improves semantic preservation with a dynamic frame rate approach and introduces a novel architecture featuring an ASR feature-assisted dual stream encoding and Transformer bottlenecks. With dynamic frame rates, it uses less frames at information-sparse regions through adaptively merging semantically similar frames. A dynamic frame rate also allows FlexiCodec to support inference-time controllable frame rates between 3Hz and 12.5Hz. Experiments on 6.25Hz, 8.3Hz and 12.5Hz average frame rates confirm that FlexiCodec excels over baseline systems in semantic information preservation and delivers a high audio reconstruction quality. We also validate the effectiveness of FlexiCodec in language model-based TTS. Demos are available at: https://flexicodec.github.io

FlexiCodec: A Dynamic Neural Audio Codec for Low Frame Rates

TL;DR

FlexiCodec addresses the challenge of preserving semantic information in neural audio codecs operating at very low frame rates. It introduces a dynamic frame-rate scheme with an ASR-guided dual-stream encoder, frame-merging/unmerging transformers, and FSQ/RVQ quantization to produce controllable frame rates between 3 and 12.5 Hz while maintaining high semantic fidelity. The approach demonstrates superior semantic preservation at low rates (e.g., 6.25 Hz) and competitive acoustic quality across bitrate- and frame-rate-manned baselines, with demonstrated benefits in downstream TTS and audio understanding tasks. This work enables efficient, flexible audio representations for LM-based systems and edge deployments, offering substantial speedups without sacrificing core intelligibility or perceptual quality.

Abstract

Neural audio codecs are foundational to speech language models. It is expected to have a low frame rate and decoupled semantic and acoustic information. A lower frame rate codec can reduce the computational cost of speech language models by shortening the sequence length. Recent studies have developed 12.5Hz low-frame-rate audio codecs, but even lower frame rate codecs remain underexplored. We find that a major challenge for very low frame rate tokens is missing semantic information. This paper introduces FlexiCodec to address this limitation. FlexiCodec improves semantic preservation with a dynamic frame rate approach and introduces a novel architecture featuring an ASR feature-assisted dual stream encoding and Transformer bottlenecks. With dynamic frame rates, it uses less frames at information-sparse regions through adaptively merging semantically similar frames. A dynamic frame rate also allows FlexiCodec to support inference-time controllable frame rates between 3Hz and 12.5Hz. Experiments on 6.25Hz, 8.3Hz and 12.5Hz average frame rates confirm that FlexiCodec excels over baseline systems in semantic information preservation and delivers a high audio reconstruction quality. We also validate the effectiveness of FlexiCodec in language model-based TTS. Demos are available at: https://flexicodec.github.io

Paper Structure

This paper contains 41 sections, 5 equations, 6 figures, 12 tables.

Figures (6)

  • Figure 1: Overview of FlexiCodec. The model encodes speech through two streams. The Frame Merging Modules dynamically reduce the 12.5Hz features into lower frame rates, and the Frame Unmerging Module restores a 12.5Hz fixed frame rate. The model is trained end to end.
  • Figure 2: Detailed views of the Frame Merging Module and the Frame Unmerging Module.
  • Figure 3: Evaluation results on three very low frame rates. Each baseline system has been retrained for each target frame rate using the same recipe as FlexiCodec.
  • Figure 4: Correlation between Flexi-Codec frame rate and phoneme rate at a fixed frame merging threshold $\tau$. Each data point is an audio in TIMIT dataset, representing the audio's average phoneme rate vs. average FlexiCodec frame rate.
  • Figure 5: Visualizations of FlexiCodec dynamic-rate tokens aligned with TIMIT dataset phonemes. We visualize four random utterances of different speakers speaking the same transcript "She had your dark suit in greasy wash water all year." The inferred content of each merged token is labeled in green font.
  • ...and 1 more figures