WeSep: A Scalable and Flexible Toolkit Towards Generalizable Target Speaker Extraction
Shuai Wang, Ke Zhang, Shaoxiong Lin, Junjie Li, Xuefei Wang, Meng Ge, Jianwei Yu, Yanmin Qian, Haizhou Li
TL;DR
Target Speaker Extraction (TSE) isolates a specified speaker from a mixture using a cue, addressing the cocktail-party challenge with cue-driven reconstruction. WeSep provides an open-source, scalable toolkit featuring unified IO, online data simulation, flexible backbones and speaker encoders, and deployment-ready exports, enabling practical TSE research and applications. It introduces online data augmentation, dynamic speaker mixing, and multiple fusion schemes, along with joint training and online enrollment sampling to improve generalization. Experiments on Libri2Mix and VoxCeleb1 show how fusion strategy, encoder choice, and training paradigm influence performance and cross-domain generalization, with practical deployment options and future plans to extend to audio-visual TSE and blind speech separation.
Abstract
Target speaker extraction (TSE) focuses on isolating the speech of a specific target speaker from overlapped multi-talker speech, which is a typical setup in the cocktail party problem. In recent years, TSE draws increasing attention due to its potential for various applications such as user-customized interfaces and hearing aids, or as a crutial front-end processing technologies for subsequential tasks such as speech recognition and speaker recongtion. However, there are currently few open-source toolkits or available pre-trained models for off-the-shelf usage. In this work, we introduce WeSep, a toolkit designed for research and practical applications in TSE. WeSep is featured with flexible target speaker modeling, scalable data management, effective on-the-fly data simulation, structured recipes and deployment support. The toolkit is publicly avaliable at \url{https://github.com/wenet-e2e/WeSep.}
