ESPnet-SPK: full pipeline speaker embedding toolkit with reproducible recipes, self-supervised front-ends, and off-the-shelf models
Jee-weon Jung, Wangyou Zhang, Jiatong Shi, Zakaria Aldeneh, Takuya Higuchi, Barry-John Theobald, Ahmed Hussen Abdelaziz, Shinji Watanabe
TL;DR
ESPnet-SPK delivers a versatile, open-source toolkit for speaker embedding extraction that unifies data processing, model architectures, and evaluation within a modular, reproducible framework. It supports a spectrum of models from x-vector to SKA-TDNN, enables self-supervised front-ends via SSL models like WavLM, and provides off-the-shelf deployment by publishing trained extractors to HuggingFace. The framework demonstrates superior or competitive performance across VoxCeleb, VoxBlink, and SASV benchmarks, and proves useful in downstream tasks such as TTS and TSE through reproducible recipes. By enabling easy experimentation with SSL front-ends, modular components, and cross-domain applications, ESPnet-SPK significantly lowers the barrier to advancing robust, transferable speaker-embedding research.
Abstract
This paper introduces ESPnet-SPK, a toolkit designed with several objectives for training speaker embedding extractors. First, we provide an open-source platform for researchers in the speaker recognition community to effortlessly build models. We provide several models, ranging from x-vector to recent SKA-TDNN. Through the modularized architecture design, variants can be developed easily. We also aspire to bridge developed models with other domains, facilitating the broad research community to effortlessly incorporate state-of-the-art embedding extractors. Pre-trained embedding extractors can be accessed in an off-the-shelf manner and we demonstrate the toolkit's versatility by showcasing its integration with two tasks. Another goal is to integrate with diverse self-supervised learning features. We release a reproducible recipe that achieves an equal error rate of 0.39% on the Vox1-O evaluation protocol using WavLM-Large with ECAPA-TDNN.
