Target Speaker Extraction with Curriculum Learning
Yun Liu, Xuechen Liu, Xiaoxiao Miao, Junichi Yamagishi
TL;DR
This work addresses target speaker extraction (TSE) by introducing Curriculum Learning (CL) to progressively expose a TSE model to increasing data difficulty. It designs multiple difficulty measures (gender, speaker similarity, SDR, SNR) and a self-paced dynamic criterion to guide data selection and training schedules, evaluated on Libri2talker with a conformer-based TSE architecture. The study finds that CL substantially improves target isolation performance, with self-paced CL delivering the strongest gain (~1 dB iSDR) over baselines, and speaker-similarity-based CL providing a strong, practical alternative. These results demonstrate that structured, gradual exposure to harder mixtures enhances robustness and generalization in TSE systems, with clear guidance on measure selection and training strategy for real-world deployment.
Abstract
This paper presents a novel approach to target speaker extraction (TSE) using Curriculum Learning (CL) techniques, addressing the challenge of distinguishing a target speaker's voice from a mixture containing interfering speakers. For efficient training, we propose designing a curriculum that selects subsets of increasing complexity, such as increasing similarity between target and interfering speakers, and that selects training data strategically. Our CL strategies include both variants using predefined difficulty measures (e.g. gender, speaker similarity, and signal-to-distortion ratio) and ones using the TSE's standard objective function, each designed to expose the model gradually to more challenging scenarios. Comprehensive testing on the Libri2talker dataset demonstrated that our CL strategies for TSE improved the performance, and the results markedly exceeded baseline models without CL about 1 dB.
