Table of Contents
Fetching ...

Libri2Vox Dataset: Target Speaker Extraction with Diverse Speaker Conditions and Synthetic Data

Yun Liu, Xuechen Liu, Xiaoxiao Miao, Junichi Yamagishi

TL;DR

The paper tackles the robustness and generalization gap in target speaker extraction by introducing Libri2Vox, a dataset that pairs clean LibriTTS targets with diverse VoxCeleb2 interference under realistic noise. It augments this with synthetic interference generated by SALT and SynVox2 and investigates curriculum learning to progressively expose models to harder scenarios. Across multiple TSE architectures, including Conformer, BLSTM, SpeakerBeam, and VoiceFilter, Libri2Vox plus synthetic data and CL yield consistent gains, with Conformer achieving the largest improvements (e.g., up to 16.20 dB iSDR on Libri2Vox with three-stage CL). The results demonstrate that combining diverse real-world data, synthetic speaker augmentation, and structured training strategies significantly enhances TSE performance and robustness in realistic environments, with implications for improved performance in hearing aids, conferencing, and related applications.

Abstract

Target speaker extraction (TSE) is essential in speech processing applications, particularly in scenarios with complex acoustic environments. Current TSE systems face challenges in limited data diversity and a lack of robustness in real-world conditions, primarily because they are trained on artificially mixed datasets with limited speaker variability and unrealistic noise profiles. To address these challenges, we propose Libri2Vox, a new dataset that combines clean target speech from the LibriTTS dataset with interference speech from the noisy VoxCeleb2 dataset, providing a large and diverse set of speakers under realistic noisy conditions. We also augment Libri2Vox with synthetic speakers generated using state-of-the-art speech generative models to enhance speaker diversity. Additionally, to further improve the effectiveness of incorporating synthetic data, curriculum learning is implemented to progressively train TSE models with increasing levels of difficulty. Extensive experiments across multiple TSE architectures reveal varying degrees of improvement, with SpeakerBeam demonstrating the most substantial gains: a 1.39 dB improvement in signal-to-distortion ratio (SDR) on the Libri2Talker test set compared to baseline training. Building upon these results, we further enhanced performance through our speaker similarity-based curriculum learning approach with the Conformer architecture, achieving an additional 0.78 dB improvement over conventional random sampling methods in which data samples are randomly selected from the entire dataset. These results demonstrate the complementary benefits of diverse real-world data, synthetic speaker augmentation, and structured training strategies in building robust TSE systems.

Libri2Vox Dataset: Target Speaker Extraction with Diverse Speaker Conditions and Synthetic Data

TL;DR

The paper tackles the robustness and generalization gap in target speaker extraction by introducing Libri2Vox, a dataset that pairs clean LibriTTS targets with diverse VoxCeleb2 interference under realistic noise. It augments this with synthetic interference generated by SALT and SynVox2 and investigates curriculum learning to progressively expose models to harder scenarios. Across multiple TSE architectures, including Conformer, BLSTM, SpeakerBeam, and VoiceFilter, Libri2Vox plus synthetic data and CL yield consistent gains, with Conformer achieving the largest improvements (e.g., up to 16.20 dB iSDR on Libri2Vox with three-stage CL). The results demonstrate that combining diverse real-world data, synthetic speaker augmentation, and structured training strategies significantly enhances TSE performance and robustness in realistic environments, with implications for improved performance in hearing aids, conferencing, and related applications.

Abstract

Target speaker extraction (TSE) is essential in speech processing applications, particularly in scenarios with complex acoustic environments. Current TSE systems face challenges in limited data diversity and a lack of robustness in real-world conditions, primarily because they are trained on artificially mixed datasets with limited speaker variability and unrealistic noise profiles. To address these challenges, we propose Libri2Vox, a new dataset that combines clean target speech from the LibriTTS dataset with interference speech from the noisy VoxCeleb2 dataset, providing a large and diverse set of speakers under realistic noisy conditions. We also augment Libri2Vox with synthetic speakers generated using state-of-the-art speech generative models to enhance speaker diversity. Additionally, to further improve the effectiveness of incorporating synthetic data, curriculum learning is implemented to progressively train TSE models with increasing levels of difficulty. Extensive experiments across multiple TSE architectures reveal varying degrees of improvement, with SpeakerBeam demonstrating the most substantial gains: a 1.39 dB improvement in signal-to-distortion ratio (SDR) on the Libri2Talker test set compared to baseline training. Building upon these results, we further enhanced performance through our speaker similarity-based curriculum learning approach with the Conformer architecture, achieving an additional 0.78 dB improvement over conventional random sampling methods in which data samples are randomly selected from the entire dataset. These results demonstrate the complementary benefits of diverse real-world data, synthetic speaker augmentation, and structured training strategies in building robust TSE systems.

Paper Structure

This paper contains 41 sections, 2 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Basic conceptual framework of TSE.
  • Figure 2: Data generation framework of Libri2Vox and its synthetic version.
  • Figure 3: Details of Conformer, BLSTM, and SpeakerBeam TSE models.
  • Figure 4: Three stage curriculum learning.
  • Figure 5: Impact of synthetic speaker ratio within one batch at Stage 3 of the configuration "w/ 3-stage CL (Real + SALT)" on Conformer. The red dashed line corresponds to the performance (7.17 dB) where the training starts from scratch using synthetic data only.