Table of Contents
Fetching ...

Universal Speaker Embedding Free Target Speaker Extraction and Personal Voice Activity Detection

Bang Zeng, Ming Li

TL;DR

The paper tackles the practical challenge of determining who spoke what and when in real-world audio by proposing USEF-TP, a speaker-embedding-free model that jointly performs Target Speaker Extraction (TSE) and Personal Voice Activity Detection (PVAD). It introduces a cross-attention–based frame-level speaker representation, a fusion backbone, a TF-GridNet–style separator, and an interaction module that feeds PVAD information back into the TSE path, all optimized with a scene-aware loss to handle varying overlap levels. The approach achieves state-of-the-art or competitive results on LibriMix and SparseLibriMix for TSE and PVAD, and shows competitive performance on CALLHOME with real recordings, while outperforming embedding-based baselines and single-task models. Overall, USEF-TP demonstrates robust, joint modeling of TSE and PVAD without speaker embeddings, offering practical benefits for real-world diarization and speech processing systems across diverse overlap conditions.

Abstract

Determining 'who spoke what and when' remains challenging in real-world applications. In typical scenarios, Speaker Diarization (SD) is employed to address the problem of 'who spoke when,' while Target Speaker Extraction (TSE) or Target Speaker Automatic Speech Recognition (TSASR) techniques are utilized to resolve the issue of 'who spoke what.' Although some works have achieved promising results by combining SD and TSE systems, inconsistencies remain between SD and TSE regarding both output inconsistency and scenario mismatch. To address these limitations, we propose a Universal Speaker Embedding Free Target Speaker Extraction and Personal Voice Activity Detection (USEF-TP) model that jointly performs TSE and Personal Voice Activity Detection (PVAD). USEF-TP leverages frame-level features obtained through a cross-attention mechanism as speaker-related features instead of using speaker embeddings as in traditional approaches. Additionally, a multi-task learning algorithm with a scenario-aware differentiated loss function is applied to ensure robust performance across various levels of speaker overlap. The experimental results show that our proposed USEF-TP model achieves superior performance in TSE and PVAD tasks on the LibriMix and SparseLibriMix datasets. The results on the CALLHOME dataset demonstrate the competitive performance of our model on real recordings.

Universal Speaker Embedding Free Target Speaker Extraction and Personal Voice Activity Detection

TL;DR

The paper tackles the practical challenge of determining who spoke what and when in real-world audio by proposing USEF-TP, a speaker-embedding-free model that jointly performs Target Speaker Extraction (TSE) and Personal Voice Activity Detection (PVAD). It introduces a cross-attention–based frame-level speaker representation, a fusion backbone, a TF-GridNet–style separator, and an interaction module that feeds PVAD information back into the TSE path, all optimized with a scene-aware loss to handle varying overlap levels. The approach achieves state-of-the-art or competitive results on LibriMix and SparseLibriMix for TSE and PVAD, and shows competitive performance on CALLHOME with real recordings, while outperforming embedding-based baselines and single-task models. Overall, USEF-TP demonstrates robust, joint modeling of TSE and PVAD without speaker embeddings, offering practical benefits for real-world diarization and speech processing systems across diverse overlap conditions.

Abstract

Determining 'who spoke what and when' remains challenging in real-world applications. In typical scenarios, Speaker Diarization (SD) is employed to address the problem of 'who spoke when,' while Target Speaker Extraction (TSE) or Target Speaker Automatic Speech Recognition (TSASR) techniques are utilized to resolve the issue of 'who spoke what.' Although some works have achieved promising results by combining SD and TSE systems, inconsistencies remain between SD and TSE regarding both output inconsistency and scenario mismatch. To address these limitations, we propose a Universal Speaker Embedding Free Target Speaker Extraction and Personal Voice Activity Detection (USEF-TP) model that jointly performs TSE and Personal Voice Activity Detection (PVAD). USEF-TP leverages frame-level features obtained through a cross-attention mechanism as speaker-related features instead of using speaker embeddings as in traditional approaches. Additionally, a multi-task learning algorithm with a scenario-aware differentiated loss function is applied to ensure robust performance across various levels of speaker overlap. The experimental results show that our proposed USEF-TP model achieves superior performance in TSE and PVAD tasks on the LibriMix and SparseLibriMix datasets. The results on the CALLHOME dataset demonstrate the competitive performance of our model on real recordings.
Paper Structure (38 sections, 25 equations, 6 figures, 7 tables)

This paper contains 38 sections, 25 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: The diagram of a typical target speaker extraction or personal voice activity detection methods. The speaker embedding extractor is typically a pre-trained speaker recognition model. 'C' denotes the concatenation.
  • Figure 2: The diagram of the USEF-TP model. 'CMHA' denotes the cross multi-head attention. $\otimes$ is an operation for element-wise product.
  • Figure 3: The diagram of different scene clips from a mixed audio recording. 'TA' denotes Target epaker Active. 'TS' denotes Target speaker Silence.
  • Figure 4: The diagram of USEF-TP model. 'm' and 'r' denote the mixed speech and reference speech, respectively. We use two weight sharing encoder to process the mixed and reference speech separately. $\otimes$ is an operation for element-wise product. The Separator’s parameters are set identically to those of the TF-GridNet approach.
  • Figure 5: The diagram of SEB-TP model. 'm' and 'r' denote the mixed speech and refernece speech, respectively. $\boldsymbol{E}_{r}$ denotes the target speaker embedding. $\otimes$ is an operation for element-wise product. The Separator’s parameters are set identically to those of the TF-GridNet approach.
  • ...and 1 more figures