Table of Contents
Fetching ...

Multi-Level Speaker Representation for Target Speaker Extraction

Ke Zhang, Junjie Li, Shuai Wang, Yangjie Wei, Yi Wang, Yannan Wang, Haizhou Li

TL;DR

This paper addresses target speaker extraction by alleviating speaker confusion from pre-trained embeddings through a multi-level speaker representation that spans raw spectral cues to neural embeddings. It introduces the TF Map spectral feature, a Contextual Embedding via cross-attention, and an utterance-level Speaker Embedding, integrated with a Band-Split RNN backbone and a pre-trained ECAPA-TDNN encoder. The results show that spectral-level TF Map features, especially when combined with contextual embeddings, significantly boost SI-SDRi and extraction accuracy on Libri2mix, with clear generalization benefits over high-level speaker embeddings. The approach offers a compact yet effective reference cue strategy that improves robustness to speaker variability and enhances practical TSE performance.

Abstract

Target speaker extraction (TSE) relies on a reference cue of the target to extract the target speech from a speech mixture. While a speaker embedding is commonly used as the reference cue, such embedding pre-trained with a large number of speakers may suffer from confusion of speaker identity. In this work, we propose a multi-level speaker representation approach, from raw features to neural embeddings, to serve as the speaker reference cue. We generate a spectral-level representation from the enrollment magnitude spectrogram as a raw, low-level feature, which significantly improves the model's generalization capability. Additionally, we propose a contextual embedding feature based on cross-attention mechanisms that integrate frame-level embeddings from a pre-trained speaker encoder. By incorporating speaker features across multiple levels, we significantly enhance the performance of the TSE model. Our approach achieves a 2.74 dB improvement and a 4.94% increase in extraction accuracy on Libri2mix test set over the baseline.

Multi-Level Speaker Representation for Target Speaker Extraction

TL;DR

This paper addresses target speaker extraction by alleviating speaker confusion from pre-trained embeddings through a multi-level speaker representation that spans raw spectral cues to neural embeddings. It introduces the TF Map spectral feature, a Contextual Embedding via cross-attention, and an utterance-level Speaker Embedding, integrated with a Band-Split RNN backbone and a pre-trained ECAPA-TDNN encoder. The results show that spectral-level TF Map features, especially when combined with contextual embeddings, significantly boost SI-SDRi and extraction accuracy on Libri2mix, with clear generalization benefits over high-level speaker embeddings. The approach offers a compact yet effective reference cue strategy that improves robustness to speaker variability and enhances practical TSE performance.

Abstract

Target speaker extraction (TSE) relies on a reference cue of the target to extract the target speech from a speech mixture. While a speaker embedding is commonly used as the reference cue, such embedding pre-trained with a large number of speakers may suffer from confusion of speaker identity. In this work, we propose a multi-level speaker representation approach, from raw features to neural embeddings, to serve as the speaker reference cue. We generate a spectral-level representation from the enrollment magnitude spectrogram as a raw, low-level feature, which significantly improves the model's generalization capability. Additionally, we propose a contextual embedding feature based on cross-attention mechanisms that integrate frame-level embeddings from a pre-trained speaker encoder. By incorporating speaker features across multiple levels, we significantly enhance the performance of the TSE model. Our approach achieves a 2.74 dB improvement and a 4.94% increase in extraction accuracy on Libri2mix test set over the baseline.

Paper Structure

This paper contains 16 sections, 7 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: The architecture of the proposed target speaker extraction model with multi-level speaker features as the reference cue. The model consists of two main pipelines. The upper left-to-right pipeline is the speaker extraction module, while the lower left-to-right pipeline is a speaker encoder that extracts multi-level speaker representation. The lower pipeline serves as the reference cue of the upper pipeline in a target speaker extraction task. $\left| \cdot \right|$ denotes the magnitude operation. $\mathbf{F}_{\text{tf-map}}$, $\mathbf{F}_{\text{context}}$, and $\mathbf{F}_{\text{spk}}$ denote the TF Map feature, Contextual Embedding feature, and Speaker Embedding feature, respectively.
  • Figure 2: The calculation process of the TF Map feature. It consists of two non-negative components: basis vectors from the enrollment's magnitude spectrogram and a weighting matrix, computed based on either 1) Spectral Similarity or 2) Embedding Similarity between the mixture and the enrollment.
  • Figure 3: SI-SDRi with a single feature on training, validation, and test set.