On the effectiveness of enrollment speech augmentation for Target Speaker Extraction

Junjie Li; Ke Zhang; Shuai Wang; Haizhou Li; Man-Wai Mak; Kong Aik Lee

On the effectiveness of enrollment speech augmentation for Target Speaker Extraction

Junjie Li, Ke Zhang, Shuai Wang, Haizhou Li, Man-Wai Mak, Kong Aik Lee

TL;DR

This work thoroughly investigates the effectiveness of augmenting the enrollment speech space and proposes a novel augmentation method called self-estimated speech augmentation (SSA), which can achieve an improvement of up to 2.5 dB.

Abstract

Deep learning technologies have significantly advanced the performance of target speaker extraction (TSE) tasks. To enhance the generalization and robustness of these algorithms when training data is insufficient, data augmentation is a commonly adopted technique. Unlike typical data augmentation applied to speech mixtures, this work thoroughly investigates the effectiveness of augmenting the enrollment speech space. We found that for both pretrained and jointly optimized speaker encoders, directly augmenting the enrollment speech leads to consistent performance improvement. In addition to conventional methods such as noise and reverberation addition, we propose a novel augmentation method called self-estimated speech augmentation (SSA). Experimental results on the Libri2Mix test set show that our proposed method can achieve an improvement of up to 2.5 dB.

On the effectiveness of enrollment speech augmentation for Target Speaker Extraction

TL;DR

Abstract

Paper Structure (20 sections, 20 equations, 2 figures, 5 tables)

This paper contains 20 sections, 20 equations, 2 figures, 5 tables.

Introduction
Methods
Formulation of TSE task
Pipeline of TSE task
Loss function
Augmentations on enrollment speech
Noise
Reverberation
SpecAugment
Self-estimated speech augmentation
Experimental details
Datasets
Experimental settings
Evaluation metrics
Results and Analysis
...and 5 more sections

Figures (2)

Figure 1: The pipeline of TSE. The speaker encoder could be pretrained from a speaker recognition task and frozen, or jointly trained from scratch with other modules during the training stage. We utilize ResNet34 as the speaker encoder; so the enrollment speech needs to be first transformed to Fbank. Noise and reverberation augmentation methods are applied directly on enrollment speech $\bf c$, and SpecAugment is applied to the Fbank feature $\mathbf{C}_f$.
Figure 2: Within the single-optimization method, the TSE models share the same parameters. Similarly, within the multi-optimization method, the TSE models also share the same parameters.

On the effectiveness of enrollment speech augmentation for Target Speaker Extraction

TL;DR

Abstract

On the effectiveness of enrollment speech augmentation for Target Speaker Extraction

Authors

TL;DR

Abstract

Table of Contents

Figures (2)