Table of Contents
Fetching ...

ESPnet-SPK: full pipeline speaker embedding toolkit with reproducible recipes, self-supervised front-ends, and off-the-shelf models

Jee-weon Jung, Wangyou Zhang, Jiatong Shi, Zakaria Aldeneh, Takuya Higuchi, Barry-John Theobald, Ahmed Hussen Abdelaziz, Shinji Watanabe

TL;DR

ESPnet-SPK delivers a versatile, open-source toolkit for speaker embedding extraction that unifies data processing, model architectures, and evaluation within a modular, reproducible framework. It supports a spectrum of models from x-vector to SKA-TDNN, enables self-supervised front-ends via SSL models like WavLM, and provides off-the-shelf deployment by publishing trained extractors to HuggingFace. The framework demonstrates superior or competitive performance across VoxCeleb, VoxBlink, and SASV benchmarks, and proves useful in downstream tasks such as TTS and TSE through reproducible recipes. By enabling easy experimentation with SSL front-ends, modular components, and cross-domain applications, ESPnet-SPK significantly lowers the barrier to advancing robust, transferable speaker-embedding research.

Abstract

This paper introduces ESPnet-SPK, a toolkit designed with several objectives for training speaker embedding extractors. First, we provide an open-source platform for researchers in the speaker recognition community to effortlessly build models. We provide several models, ranging from x-vector to recent SKA-TDNN. Through the modularized architecture design, variants can be developed easily. We also aspire to bridge developed models with other domains, facilitating the broad research community to effortlessly incorporate state-of-the-art embedding extractors. Pre-trained embedding extractors can be accessed in an off-the-shelf manner and we demonstrate the toolkit's versatility by showcasing its integration with two tasks. Another goal is to integrate with diverse self-supervised learning features. We release a reproducible recipe that achieves an equal error rate of 0.39% on the Vox1-O evaluation protocol using WavLM-Large with ECAPA-TDNN.

ESPnet-SPK: full pipeline speaker embedding toolkit with reproducible recipes, self-supervised front-ends, and off-the-shelf models

TL;DR

ESPnet-SPK delivers a versatile, open-source toolkit for speaker embedding extraction that unifies data processing, model architectures, and evaluation within a modular, reproducible framework. It supports a spectrum of models from x-vector to SKA-TDNN, enables self-supervised front-ends via SSL models like WavLM, and provides off-the-shelf deployment by publishing trained extractors to HuggingFace. The framework demonstrates superior or competitive performance across VoxCeleb, VoxBlink, and SASV benchmarks, and proves useful in downstream tasks such as TTS and TSE through reproducible recipes. By enabling easy experimentation with SSL front-ends, modular components, and cross-domain applications, ESPnet-SPK significantly lowers the barrier to advancing robust, transferable speaker-embedding research.

Abstract

This paper introduces ESPnet-SPK, a toolkit designed with several objectives for training speaker embedding extractors. First, we provide an open-source platform for researchers in the speaker recognition community to effortlessly build models. We provide several models, ranging from x-vector to recent SKA-TDNN. Through the modularized architecture design, variants can be developed easily. We also aspire to bridge developed models with other domains, facilitating the broad research community to effortlessly incorporate state-of-the-art embedding extractors. Pre-trained embedding extractors can be accessed in an off-the-shelf manner and we demonstrate the toolkit's versatility by showcasing its integration with two tasks. Another goal is to integrate with diverse self-supervised learning features. We release a reproducible recipe that achieves an equal error rate of 0.39% on the Vox1-O evaluation protocol using WavLM-Large with ECAPA-TDNN.
Paper Structure (14 sections, 3 figures, 7 tables)

This paper contains 14 sections, 3 figures, 7 tables.

Figures (3)

  • Figure 1: Process Pipeline of ESPnet-SPK, structured in multiple stages akin to the Kaldi Kaldi-Povey2011 speech processing toolkit. The top section outlines stages 1 through 10, the speaker verification process, with stages 9 and 10 dedicated to optionally publishing trained speaker embedding extractors. Furthermore, it highlights the ease of using publicly available embedding extractors in an off-the-shelf manner.
  • Figure 2: Illustration of the modular sub-components of the speaker embedding extractor. Users can effortlessly construct thousands of model architectures in the configuration file by combining these sub-components.
  • Figure 3: Sample Code demonstrating the use of public and custom speaker embedding extractors of ESPnet-SPK.