Probing Self-supervised Learning Models with Target Speech Extraction
Junyi Peng, Marc Delcroix, Tsubasa Ochiai, Oldrich Plchot, Takanori Ashihara, Shoko Araki, Jan Cernocky
TL;DR
The paper addresses the challenge of evaluating self-supervised speech models on target speech extraction (TSE), a task that requires simultaneously identifying the target speaker and separating its speech from a mixture using enrollment speech. It introduces a SUPERB-style TSE framework built on a frozen SSL backbone, with a speaker encoder using multi-head factorized attentive pooling (MHFA) and an extractor (MixNet+MaskNet) that are conditioned on the target embedding to recover the target speech. Through extensive experiments on Libri2Mix and VoxCeleb1, the study shows that TSE performance does not directly follow from traditional SV or separation tasks, and analyzes how architectural choices—such as fusion strategy, loss functions, and encoder/decoder design—affect results, achieving fast convergence but still lagging the best dedicated TSE system, TD-SpeakerBeam. The findings offer guidance for pretraining and downstream design to better probe SSL representations for fine-grained, speaker-specific speech extraction and highlight future directions, including SSL fine-tuning for TSE and improved temporal resolution of SSL models.
Abstract
Large-scale pre-trained self-supervised learning (SSL) models have shown remarkable advancements in speech-related tasks. However, the utilization of these models in complex multi-talker scenarios, such as extracting a target speaker in a mixture, is yet to be fully evaluated. In this paper, we introduce target speech extraction (TSE) as a novel downstream task to evaluate the feature extraction capabilities of pre-trained SSL models. TSE uniquely requires both speaker identification and speech separation, distinguishing it from other tasks in the Speech processing Universal PERformance Benchmark (SUPERB) evaluation. Specifically, we propose a TSE downstream model composed of two lightweight task-oriented modules based on the same frozen SSL model. One module functions as a speaker encoder to obtain target speaker information from an enrollment speech, while the other estimates the target speaker's mask to extract its speech from the mixture. Experimental results on the Libri2mix datasets reveal the relevance of the TSE downstream task to probe SSL models, as its performance cannot be simply deduced from other related tasks such as speaker verification and separation.
