Table of Contents
Fetching ...

Target Speaker Extraction through Comparing Noisy Positive and Negative Audio Enrollments

Shitong Xu, Yiyuan Yang, Niki Trigoni, Andrew Markham

TL;DR

This work tackles target speaker extraction when enrollment samples are noisy and overlapped with interfering speech. It introduces a positive/negative enrollment framework and a two-branch encoder with an attention-based fusion mechanism to produce a target speaker embedding used to condition extraction, enabling monaural and binaural TSE without clean enrollments. A two-stage training regime accelerates convergence and improves robustness, achieving state-of-the-art SI-SNRi on monaural mixtures with noisy enrollments and demonstrating resilience to labeling errors and realistic perturbations. While promising, the approach also reveals artifacts and a gap to clean-enrollment SOTA, suggesting avenues for further improvement in encoder flexibility and artifact suppression. The results have practical implications for real-world communication aids and audio editing in environments where clean target samples are unavailable.

Abstract

Target speaker extraction focuses on isolating a specific speaker's voice from an audio mixture containing multiple speakers. To provide information about the target speaker's identity, prior works have utilized clean audio samples as conditioning inputs. However, such clean audio examples are not always readily available. For instance, obtaining a clean recording of a stranger's voice at a cocktail party without leaving the noisy environment is generally infeasible. Limited prior research has explored extracting the target speaker's characteristics from noisy enrollments, which may contain overlapping speech from interfering speakers. In this work, we explore a novel enrollment strategy that encodes target speaker information from the noisy enrollment by comparing segments where the target speaker is talking (Positive Enrollments) with segments where the target speaker is silent (Negative Enrollments). Experiments show the effectiveness of our model architecture, which achieves over 2.1 dB higher SI-SNRi compared to prior works in extracting the monaural speech from the mixture of two speakers. Additionally, the proposed two-stage training strategy accelerates convergence, reducing the number of optimization steps required to reach 3 dB SNR by 60%. Overall, our method achieves state-of-the-art performance in the monaural target speaker extraction conditioned on noisy enrollments. Our implementation is available at https://github.com/xu-shitong/TSE-through-Positive-Negative-Enroll .

Target Speaker Extraction through Comparing Noisy Positive and Negative Audio Enrollments

TL;DR

This work tackles target speaker extraction when enrollment samples are noisy and overlapped with interfering speech. It introduces a positive/negative enrollment framework and a two-branch encoder with an attention-based fusion mechanism to produce a target speaker embedding used to condition extraction, enabling monaural and binaural TSE without clean enrollments. A two-stage training regime accelerates convergence and improves robustness, achieving state-of-the-art SI-SNRi on monaural mixtures with noisy enrollments and demonstrating resilience to labeling errors and realistic perturbations. While promising, the approach also reveals artifacts and a gap to clean-enrollment SOTA, suggesting avenues for further improvement in encoder flexibility and artifact suppression. The results have practical implications for real-world communication aids and audio editing in environments where clean target samples are unavailable.

Abstract

Target speaker extraction focuses on isolating a specific speaker's voice from an audio mixture containing multiple speakers. To provide information about the target speaker's identity, prior works have utilized clean audio samples as conditioning inputs. However, such clean audio examples are not always readily available. For instance, obtaining a clean recording of a stranger's voice at a cocktail party without leaving the noisy environment is generally infeasible. Limited prior research has explored extracting the target speaker's characteristics from noisy enrollments, which may contain overlapping speech from interfering speakers. In this work, we explore a novel enrollment strategy that encodes target speaker information from the noisy enrollment by comparing segments where the target speaker is talking (Positive Enrollments) with segments where the target speaker is silent (Negative Enrollments). Experiments show the effectiveness of our model architecture, which achieves over 2.1 dB higher SI-SNRi compared to prior works in extracting the monaural speech from the mixture of two speakers. Additionally, the proposed two-stage training strategy accelerates convergence, reducing the number of optimization steps required to reach 3 dB SNR by 60%. Overall, our method achieves state-of-the-art performance in the monaural target speaker extraction conditioned on noisy enrollments. Our implementation is available at https://github.com/xu-shitong/TSE-through-Positive-Negative-Enroll .

Paper Structure

This paper contains 38 sections, 3 equations, 10 figures, 15 tables.

Figures (10)

  • Figure 1: a) Task scenario explored in the work. Users identify a speaker of interest in an audio mixture, and labeling when the target speaker speaks (Positive Enrollment) or remains silent (Negative Enrollment) in the audio mixture. b) Decomposition of each speaker's voice in the audio mixture. Due to the stochasticity in human conversation, the interfering speaker will either remain silent in some of the segments in the Positive Enrollment or speak in the Negative Enrollment, leaving the target speaker the only speaker who talks throughout the Positive Enrollments but not in the Negative Enrollment. c) The model performs self-attention between the encoded Positive and Negative Enrollments to extract the target speaker's characteristic, which serves as the conditional information for the following extraction model. The model then extract the target speaker from the Audio Mixture.
  • Figure 2: Encoding and Extraction Branch model architecture and training pipeline.
  • Figure 3: Encoder Fusion Module pseudo-code.
  • Figure 4: Validation loss values for the optimization step. The curve for the two-stage training begins at the 200k step to account for the 200k optimization steps performed during the first training stage.
  • Figure 5: SI-SDRi under Inaccurate User labelings.
  • ...and 5 more figures