Target Speech Extraction with Pre-trained Self-supervised Learning Models

Junyi Peng; Marc Delcroix; Tsubasa Ochiai; Oldrich Plchot; Shoko Araki; Jan Cernocky

Target Speech Extraction with Pre-trained Self-supervised Learning Models

Junyi Peng, Marc Delcroix, Tsubasa Ochiai, Oldrich Plchot, Shoko Araki, Jan Cernocky

TL;DR

This work evaluates pre-trained self-supervised learning (SSL) models for target speech extraction (TSE) and introduces a SUPERB-style downstream task to benchmark SSL use in TSE. It presents two plug-in modules, Adaptive Input Enhancer (AIE) and a SSL-based speaker encoder (SpkEnc), to fuse multi-scale SSL features with a TD-SpeakerBeam framework, achieving notable improvements when fine-tuned. On Libri2Mix, SSL representations substantially outperform traditional acoustic features, with the best TD-SpeakerBeam extension reaching SI-SDRi around 14.65 dB and reducing failure rates, especially after joint fine-tuning of SSL components. The results underscore the potential of integrating CNN and Transformer representations through hierarchical upsampling, offering a scalable pathway to leverage SSL in complex speech extraction tasks and guiding future efficiency-focused work.

Abstract

Pre-trained self-supervised learning (SSL) models have achieved remarkable success in various speech tasks. However, their potential in target speech extraction (TSE) has not been fully exploited. TSE aims to extract the speech of a target speaker in a mixture guided by enrollment utterances. We exploit pre-trained SSL models for two purposes within a TSE framework, i.e., to process the input mixture and to derive speaker embeddings from the enrollment. In this paper, we focus on how to effectively use SSL models for TSE. We first introduce a novel TSE downstream task following the SUPERB principles. This simple experiment shows the potential of SSL models for TSE, but extraction performance remains far behind the state-of-the-art. We then extend a powerful TSE architecture by incorporating two SSL-based modules: an Adaptive Input Enhancer (AIE) and a speaker encoder. Specifically, the proposed AIE utilizes intermediate representations from the CNN encoder by adjusting the time resolution of CNN encoder and transformer blocks through progressive upsampling, capturing both fine-grained and hierarchical features. Our method outperforms current TSE systems achieving a SI-SDR improvement of 14.0 dB on LibriMix. Moreover, we can further improve performance by 0.7 dB by fine-tuning the whole model including the SSL model parameters.

Target Speech Extraction with Pre-trained Self-supervised Learning Models

TL;DR

Abstract

Paper Structure (13 sections, 3 figures, 3 tables)

This paper contains 13 sections, 3 figures, 3 tables.

Introduction
Prior works
Conventional neural TSE
Exploiting pre-trained SSL models for TSE
SUPERB-style downstream TSE model
TD-SpeakerBeam extension with pre-trained SSL models
Adaptive input enhancer
Speaker encoder based on pre-trained model
Experiments
Experiment Setup
Evaluation results following SUPERB's setup
Evaluation Results on TD-SpeakerBeam setup
Conclusions

Figures (3)

Figure 1: Layer-wise weights of speaker encoder (SpkEnc) and extractor, using the BLSTM-based TSE downstream model, and WavLM Base Plus pretrained SSL model. Note that $0$-th Transformer layer denotes the output of the CNN encoder, which is also the input of the 1st Transformer layer.
Figure 2: (a) Overview of the proposed SSL model-based TSE system, and the details of (b) AIE module, (c) upsample blocks, and (d) SSL-based SpkEnc.
Figure 3: Comparison of SI-SDRi scores of test set samples using TD-SpeakerBeam (X-axis) against the best SSL-based model (Y-axis).

Target Speech Extraction with Pre-trained Self-supervised Learning Models

TL;DR

Abstract

Target Speech Extraction with Pre-trained Self-supervised Learning Models

Authors

TL;DR

Abstract

Table of Contents

Figures (3)