FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit for Neural Speech Codec
Zhihao Du, Shiliang Zhang, Kai Hu, Siqi Zheng
TL;DR
The paper tackles the lack of open, reproducible frameworks for neural speech codecs and proposes FunCodec as a unified toolkit that bundles model implementations, training recipes, and evaluation tools.It introduces both time- and frequency-domain codecs (including FreqCodec) and leverages residual vector quantization with semantic augmentation, guided by adversarial training with multiple discriminators to enhance quality at low bitrates.Extensive experiments on LibriTTS and cross-corpus benchmarks demonstrate competitive speech quality, effective bitrate reduction through semantic and frequency-domain strategies, and useful downstream applicability for ASR and TTS.The work provides practical, download-ready models and scripts, enabling researchers to reproduce results, compare baselines, and integrate neural speech codecs into broader speech-text applications.
Abstract
This paper presents FunCodec, a fundamental neural speech codec toolkit, which is an extension of the open-source speech processing toolkit FunASR. FunCodec provides reproducible training recipes and inference scripts for the latest neural speech codec models, such as SoundStream and Encodec. Thanks to the unified design with FunASR, FunCodec can be easily integrated into downstream tasks, such as speech recognition. Along with FunCodec, pre-trained models are also provided, which can be used for academic or generalized purposes. Based on the toolkit, we further propose the frequency-domain codec models, FreqCodec, which can achieve comparable speech quality with much lower computation and parameter complexity. Experimental results show that, under the same compression ratio, FunCodec can achieve better reconstruction quality compared with other toolkits and released models. We also demonstrate that the pre-trained models are suitable for downstream tasks, including automatic speech recognition and personalized text-to-speech synthesis. This toolkit is publicly available at https://github.com/alibaba-damo-academy/FunCodec.
