Table of Contents
Fetching ...

FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit for Neural Speech Codec

Zhihao Du, Shiliang Zhang, Kai Hu, Siqi Zheng

TL;DR

The paper tackles the lack of open, reproducible frameworks for neural speech codecs and proposes FunCodec as a unified toolkit that bundles model implementations, training recipes, and evaluation tools.It introduces both time- and frequency-domain codecs (including FreqCodec) and leverages residual vector quantization with semantic augmentation, guided by adversarial training with multiple discriminators to enhance quality at low bitrates.Extensive experiments on LibriTTS and cross-corpus benchmarks demonstrate competitive speech quality, effective bitrate reduction through semantic and frequency-domain strategies, and useful downstream applicability for ASR and TTS.The work provides practical, download-ready models and scripts, enabling researchers to reproduce results, compare baselines, and integrate neural speech codecs into broader speech-text applications.

Abstract

This paper presents FunCodec, a fundamental neural speech codec toolkit, which is an extension of the open-source speech processing toolkit FunASR. FunCodec provides reproducible training recipes and inference scripts for the latest neural speech codec models, such as SoundStream and Encodec. Thanks to the unified design with FunASR, FunCodec can be easily integrated into downstream tasks, such as speech recognition. Along with FunCodec, pre-trained models are also provided, which can be used for academic or generalized purposes. Based on the toolkit, we further propose the frequency-domain codec models, FreqCodec, which can achieve comparable speech quality with much lower computation and parameter complexity. Experimental results show that, under the same compression ratio, FunCodec can achieve better reconstruction quality compared with other toolkits and released models. We also demonstrate that the pre-trained models are suitable for downstream tasks, including automatic speech recognition and personalized text-to-speech synthesis. This toolkit is publicly available at https://github.com/alibaba-damo-academy/FunCodec.

FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit for Neural Speech Codec

TL;DR

The paper tackles the lack of open, reproducible frameworks for neural speech codecs and proposes FunCodec as a unified toolkit that bundles model implementations, training recipes, and evaluation tools.It introduces both time- and frequency-domain codecs (including FreqCodec) and leverages residual vector quantization with semantic augmentation, guided by adversarial training with multiple discriminators to enhance quality at low bitrates.Extensive experiments on LibriTTS and cross-corpus benchmarks demonstrate competitive speech quality, effective bitrate reduction through semantic and frequency-domain strategies, and useful downstream applicability for ASR and TTS.The work provides practical, download-ready models and scripts, enabling researchers to reproduce results, compare baselines, and integrate neural speech codecs into broader speech-text applications.

Abstract

This paper presents FunCodec, a fundamental neural speech codec toolkit, which is an extension of the open-source speech processing toolkit FunASR. FunCodec provides reproducible training recipes and inference scripts for the latest neural speech codec models, such as SoundStream and Encodec. Thanks to the unified design with FunASR, FunCodec can be easily integrated into downstream tasks, such as speech recognition. Along with FunCodec, pre-trained models are also provided, which can be used for academic or generalized purposes. Based on the toolkit, we further propose the frequency-domain codec models, FreqCodec, which can achieve comparable speech quality with much lower computation and parameter complexity. Experimental results show that, under the same compression ratio, FunCodec can achieve better reconstruction quality compared with other toolkits and released models. We also demonstrate that the pre-trained models are suitable for downstream tasks, including automatic speech recognition and personalized text-to-speech synthesis. This toolkit is publicly available at https://github.com/alibaba-damo-academy/FunCodec.
Paper Structure (16 sections, 7 equations, 3 figures, 6 tables)

This paper contains 16 sections, 7 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Overview of FunCodec design.
  • Figure 2: The overall architecture of the FunCodec models.
  • Figure 3: Comparison of open-source generalized models under (a) lower and (b) higher token rate. LS denotes Librispeech test sets. While Librispeech and gigaspeech are English corpora, aishell and Wenet are Mandarin corpora.