Target Speaker Extraction through Comparing Noisy Positive and Negative Audio Enrollments
Shitong Xu, Yiyuan Yang, Niki Trigoni, Andrew Markham
TL;DR
This work tackles target speaker extraction when enrollment samples are noisy and overlapped with interfering speech. It introduces a positive/negative enrollment framework and a two-branch encoder with an attention-based fusion mechanism to produce a target speaker embedding used to condition extraction, enabling monaural and binaural TSE without clean enrollments. A two-stage training regime accelerates convergence and improves robustness, achieving state-of-the-art SI-SNRi on monaural mixtures with noisy enrollments and demonstrating resilience to labeling errors and realistic perturbations. While promising, the approach also reveals artifacts and a gap to clean-enrollment SOTA, suggesting avenues for further improvement in encoder flexibility and artifact suppression. The results have practical implications for real-world communication aids and audio editing in environments where clean target samples are unavailable.
Abstract
Target speaker extraction focuses on isolating a specific speaker's voice from an audio mixture containing multiple speakers. To provide information about the target speaker's identity, prior works have utilized clean audio samples as conditioning inputs. However, such clean audio examples are not always readily available. For instance, obtaining a clean recording of a stranger's voice at a cocktail party without leaving the noisy environment is generally infeasible. Limited prior research has explored extracting the target speaker's characteristics from noisy enrollments, which may contain overlapping speech from interfering speakers. In this work, we explore a novel enrollment strategy that encodes target speaker information from the noisy enrollment by comparing segments where the target speaker is talking (Positive Enrollments) with segments where the target speaker is silent (Negative Enrollments). Experiments show the effectiveness of our model architecture, which achieves over 2.1 dB higher SI-SNRi compared to prior works in extracting the monaural speech from the mixture of two speakers. Additionally, the proposed two-stage training strategy accelerates convergence, reducing the number of optimization steps required to reach 3 dB SNR by 60%. Overall, our method achieves state-of-the-art performance in the monaural target speaker extraction conditioned on noisy enrollments. Our implementation is available at https://github.com/xu-shitong/TSE-through-Positive-Negative-Enroll .
