Table of Contents
Fetching ...

FocalCodec: Low-Bitrate Speech Coding via Focal Modulation Networks

Luca Della Libera, Francesco Paissan, Cem Subakan, Mirco Ravanelli

TL;DR

FocalCodec addresses the challenge of low-bitrate speech coding by using a focal modulation based single binary codebook to compress speech into $0.16$–$0.65$ kbps tokens. The approach combines a robust WavLM-based encoder with a compressor–quantizer–decompressor pipeline that leverages binary spherical quantization to maintain both semantic and acoustic information post-quantization. Training is decoupled into two stages, enabling strong reconstruction while preserving downstream utility for discriminative and generative tasks, including speech resynthesis and voice conversion across multilingual and noisy data. The results show FocalCodec is competitive with or outperforms state-of-the-art low-bitrate codecs, while offering simplicity, efficiency, and strong token utilization, with practical implications for streaming and on-device speech processing. Demo samples and code are provided to facilitate public evaluation and reuse.

Abstract

Large language models have revolutionized natural language processing through self-supervised pretraining on massive datasets. Inspired by this success, researchers have explored adapting these methods to speech by discretizing continuous audio into tokens using neural audio codecs. However, existing approaches face limitations, including high bitrates, the loss of either semantic or acoustic information, and the reliance on multi-codebook designs when trying to capture both, which increases architectural complexity for downstream tasks. To address these challenges, we introduce FocalCodec, an efficient low-bitrate codec based on focal modulation that utilizes a single binary codebook to compress speech between 0.16 and 0.65 kbps. FocalCodec delivers competitive performance in speech resynthesis and voice conversion at lower bitrates than the current state-of-the-art, while effectively handling multilingual speech and noisy environments. Evaluation on downstream tasks shows that FocalCodec successfully preserves sufficient semantic and acoustic information, while also being well-suited for generative modeling. Demo samples and code are available at https://lucadellalib.github.io/focalcodec-web/.

FocalCodec: Low-Bitrate Speech Coding via Focal Modulation Networks

TL;DR

FocalCodec addresses the challenge of low-bitrate speech coding by using a focal modulation based single binary codebook to compress speech into kbps tokens. The approach combines a robust WavLM-based encoder with a compressor–quantizer–decompressor pipeline that leverages binary spherical quantization to maintain both semantic and acoustic information post-quantization. Training is decoupled into two stages, enabling strong reconstruction while preserving downstream utility for discriminative and generative tasks, including speech resynthesis and voice conversion across multilingual and noisy data. The results show FocalCodec is competitive with or outperforms state-of-the-art low-bitrate codecs, while offering simplicity, efficiency, and strong token utilization, with practical implications for streaming and on-device speech processing. Demo samples and code are provided to facilitate public evaluation and reuse.

Abstract

Large language models have revolutionized natural language processing through self-supervised pretraining on massive datasets. Inspired by this success, researchers have explored adapting these methods to speech by discretizing continuous audio into tokens using neural audio codecs. However, existing approaches face limitations, including high bitrates, the loss of either semantic or acoustic information, and the reliance on multi-codebook designs when trying to capture both, which increases architectural complexity for downstream tasks. To address these challenges, we introduce FocalCodec, an efficient low-bitrate codec based on focal modulation that utilizes a single binary codebook to compress speech between 0.16 and 0.65 kbps. FocalCodec delivers competitive performance in speech resynthesis and voice conversion at lower bitrates than the current state-of-the-art, while effectively handling multilingual speech and noisy environments. Evaluation on downstream tasks shows that FocalCodec successfully preserves sufficient semantic and acoustic information, while also being well-suited for generative modeling. Demo samples and code are available at https://lucadellalib.github.io/focalcodec-web/.

Paper Structure

This paper contains 45 sections, 4 equations, 3 figures, 8 tables.

Figures (3)

  • Figure 1: FocalCodec architecture. The encoder extracts features containing both acoustic and semantic information. These features are then mapped to a low-dimensional space by the compressor, binary quantized, and projected back by the decompressor. The decoder resynthesizes the waveform from these features.
  • Figure 2: Subjective evaluation from 33 participants averaged over 10 samples. Left. Trade-off between mean opinion score and bitrate. The green dashed line highlights the reference score. FocalCodec achieves extremely low bitrates while maintaining strong performance. Right. Distribution of mean opinion score. The red lines highlight the mean. FocalCodec@50 outperforms most baselines and remains comparable to BigCodec and Stable Codec.
  • Figure 3: Reconstructed Mel-spectrograms from LibriSpeech panayotov2015librispeech (left) and Libri1Mix cosentino2020librimix (right).