Table of Contents
Fetching ...

Target Speaker Extraction with Curriculum Learning

Yun Liu, Xuechen Liu, Xiaoxiao Miao, Junichi Yamagishi

TL;DR

This work addresses target speaker extraction (TSE) by introducing Curriculum Learning (CL) to progressively expose a TSE model to increasing data difficulty. It designs multiple difficulty measures (gender, speaker similarity, SDR, SNR) and a self-paced dynamic criterion to guide data selection and training schedules, evaluated on Libri2talker with a conformer-based TSE architecture. The study finds that CL substantially improves target isolation performance, with self-paced CL delivering the strongest gain (~1 dB iSDR) over baselines, and speaker-similarity-based CL providing a strong, practical alternative. These results demonstrate that structured, gradual exposure to harder mixtures enhances robustness and generalization in TSE systems, with clear guidance on measure selection and training strategy for real-world deployment.

Abstract

This paper presents a novel approach to target speaker extraction (TSE) using Curriculum Learning (CL) techniques, addressing the challenge of distinguishing a target speaker's voice from a mixture containing interfering speakers. For efficient training, we propose designing a curriculum that selects subsets of increasing complexity, such as increasing similarity between target and interfering speakers, and that selects training data strategically. Our CL strategies include both variants using predefined difficulty measures (e.g. gender, speaker similarity, and signal-to-distortion ratio) and ones using the TSE's standard objective function, each designed to expose the model gradually to more challenging scenarios. Comprehensive testing on the Libri2talker dataset demonstrated that our CL strategies for TSE improved the performance, and the results markedly exceeded baseline models without CL about 1 dB.

Target Speaker Extraction with Curriculum Learning

TL;DR

This work addresses target speaker extraction (TSE) by introducing Curriculum Learning (CL) to progressively expose a TSE model to increasing data difficulty. It designs multiple difficulty measures (gender, speaker similarity, SDR, SNR) and a self-paced dynamic criterion to guide data selection and training schedules, evaluated on Libri2talker with a conformer-based TSE architecture. The study finds that CL substantially improves target isolation performance, with self-paced CL delivering the strongest gain (~1 dB iSDR) over baselines, and speaker-similarity-based CL providing a strong, practical alternative. These results demonstrate that structured, gradual exposure to harder mixtures enhances robustness and generalization in TSE systems, with clear guidance on measure selection and training strategy for real-world deployment.

Abstract

This paper presents a novel approach to target speaker extraction (TSE) using Curriculum Learning (CL) techniques, addressing the challenge of distinguishing a target speaker's voice from a mixture containing interfering speakers. For efficient training, we propose designing a curriculum that selects subsets of increasing complexity, such as increasing similarity between target and interfering speakers, and that selects training data strategically. Our CL strategies include both variants using predefined difficulty measures (e.g. gender, speaker similarity, and signal-to-distortion ratio) and ones using the TSE's standard objective function, each designed to expose the model gradually to more challenging scenarios. Comprehensive testing on the Libri2talker dataset demonstrated that our CL strategies for TSE improved the performance, and the results markedly exceeded baseline models without CL about 1 dB.
Paper Structure (18 sections, 1 equation, 3 figures, 4 tables)

This paper contains 18 sections, 1 equation, 3 figures, 4 tables.

Figures (3)

  • Figure 1: The conformer-based TSE architecture. Network details and loss function can be found in Supplementary materials.
  • Figure 2: Cumulative density function of prediction results from the TSE model based on speaker similarity. Here, larger CDF values indicates higher selected proportion of corresponding iSDR(dB) in the corresponding system. "Threshold" corresponds to $\tau_{spk}$ in Section 4.2. "Random Select" is the baseline.
  • Figure 3: iSDR changes on the dev set with $\tau_{SDR}$ in two training phases, along with data usage percentage in the 1st phase.