Table of Contents
Fetching ...

WeSep: A Scalable and Flexible Toolkit Towards Generalizable Target Speaker Extraction

Shuai Wang, Ke Zhang, Shaoxiong Lin, Junjie Li, Xuefei Wang, Meng Ge, Jianwei Yu, Yanmin Qian, Haizhou Li

TL;DR

Target Speaker Extraction (TSE) isolates a specified speaker from a mixture using a cue, addressing the cocktail-party challenge with cue-driven reconstruction. WeSep provides an open-source, scalable toolkit featuring unified IO, online data simulation, flexible backbones and speaker encoders, and deployment-ready exports, enabling practical TSE research and applications. It introduces online data augmentation, dynamic speaker mixing, and multiple fusion schemes, along with joint training and online enrollment sampling to improve generalization. Experiments on Libri2Mix and VoxCeleb1 show how fusion strategy, encoder choice, and training paradigm influence performance and cross-domain generalization, with practical deployment options and future plans to extend to audio-visual TSE and blind speech separation.

Abstract

Target speaker extraction (TSE) focuses on isolating the speech of a specific target speaker from overlapped multi-talker speech, which is a typical setup in the cocktail party problem. In recent years, TSE draws increasing attention due to its potential for various applications such as user-customized interfaces and hearing aids, or as a crutial front-end processing technologies for subsequential tasks such as speech recognition and speaker recongtion. However, there are currently few open-source toolkits or available pre-trained models for off-the-shelf usage. In this work, we introduce WeSep, a toolkit designed for research and practical applications in TSE. WeSep is featured with flexible target speaker modeling, scalable data management, effective on-the-fly data simulation, structured recipes and deployment support. The toolkit is publicly avaliable at \url{https://github.com/wenet-e2e/WeSep.}

WeSep: A Scalable and Flexible Toolkit Towards Generalizable Target Speaker Extraction

TL;DR

Target Speaker Extraction (TSE) isolates a specified speaker from a mixture using a cue, addressing the cocktail-party challenge with cue-driven reconstruction. WeSep provides an open-source, scalable toolkit featuring unified IO, online data simulation, flexible backbones and speaker encoders, and deployment-ready exports, enabling practical TSE research and applications. It introduces online data augmentation, dynamic speaker mixing, and multiple fusion schemes, along with joint training and online enrollment sampling to improve generalization. Experiments on Libri2Mix and VoxCeleb1 show how fusion strategy, encoder choice, and training paradigm influence performance and cross-domain generalization, with practical deployment options and future plans to extend to audio-visual TSE and blind speech separation.

Abstract

Target speaker extraction (TSE) focuses on isolating the speech of a specific target speaker from overlapped multi-talker speech, which is a typical setup in the cocktail party problem. In recent years, TSE draws increasing attention due to its potential for various applications such as user-customized interfaces and hearing aids, or as a crutial front-end processing technologies for subsequential tasks such as speech recognition and speaker recongtion. However, there are currently few open-source toolkits or available pre-trained models for off-the-shelf usage. In this work, we introduce WeSep, a toolkit designed for research and practical applications in TSE. WeSep is featured with flexible target speaker modeling, scalable data management, effective on-the-fly data simulation, structured recipes and deployment support. The toolkit is publicly avaliable at \url{https://github.com/wenet-e2e/WeSep.}
Paper Structure (26 sections, 2 equations, 3 figures, 5 tables)

This paper contains 26 sections, 2 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Architecture of a typical TSE system, the cue encoder can be jointly trained or pretrained, an additional speaker classification loss is usually added in the joint-training mode. The parameters of the cue encoder can be shared (or partially shared) with the mixture encoder.
  • Figure 2: The online data preparation pipeline in WeSep, the case of 2 speakers is demonstrated
  • Figure 3: Dynamic Speaker Mixing Strategy