Improving curriculum learning for target speaker extraction with synthetic speakers

Yun Liu; Xuechen Liu; Junichi Yamagishi

Improving curriculum learning for target speaker extraction with synthetic speakers

Yun Liu, Xuechen Liu, Junichi Yamagishi

TL;DR

A k-nearest neighbor-based voice conversion method is used to simulate and generate speech of diverse interference speakers, and then uses the generated data as part of the CL to improve curriculum learning.

Abstract

Target speaker extraction (TSE) aims to isolate individual speaker voices from complex speech environments. The effectiveness of TSE systems is often compromised when the speaker characteristics are similar to each other. Recent research has introduced curriculum learning (CL), in which TSE models are trained incrementally on speech samples of increasing complexity. In CL training, the model is first trained on samples with low speaker similarity between the target and interference speakers, and then on samples with high speaker similarity. To further improve CL, this paper uses a $k$-nearest neighbor-based voice conversion method to simulate and generate speech of diverse interference speakers, and then uses the generated data as part of the CL. Experiments demonstrate that training data based on synthetic speakers can effectively enhance the model's capabilities and significantly improve the performance of multiple TSE systems.

Improving curriculum learning for target speaker extraction with synthetic speakers

TL;DR

Abstract

-nearest neighbor-based voice conversion method to simulate and generate speech of diverse interference speakers, and then uses the generated data as part of the CL. Experiments demonstrate that training data based on synthetic speakers can effectively enhance the model's capabilities and significantly improve the performance of multiple TSE systems.

Paper Structure (20 sections, 1 equation, 3 figures, 4 tables)

This paper contains 20 sections, 1 equation, 3 figures, 4 tables.

Introduction
Related work
Designing effective training data for TSE
CL for TSE
Improved CL using synthetic speakers
Main concept
Synthetic speaker generation based on VC
Outline
Generation of synthetic interference speakers
Experiments
Dataset
Feature extraction
Hyperparameters
Model setting
Synthetic data generation setting
...and 5 more sections

Figures (3)

Figure 1: Three stage curriculum learning
Figure 2: Synthetic speaker generation using the $k$-NN VC / SALT system.
Figure 3: Comparison on the ratio of synthetic speakers within a mini-batch.

Improving curriculum learning for target speaker extraction with synthetic speakers

TL;DR

Abstract

Improving curriculum learning for target speaker extraction with synthetic speakers

Authors

TL;DR

Abstract

Table of Contents

Figures (3)