Table of Contents
Fetching ...

Continuous Target Speech Extraction: Enhancing Personalized Diarization and Extraction on Complex Recordings

He Zhao, Hangting Chen, Jianwei Yu, Yuehai Wang

TL;DR

This work tackles continuous target speaker extraction in complex, long-form recordings with variable overlap and potential target absence by proposing C-TSE, a two-network framework that combines pBSRNN for extraction with A-TSVAD for precise target activity detection. It systematically evaluates three integration schemes—cascade1, cascade2, and parallel—finding that a cascaded fusion (A-TSVAD followed by pBSRNN) achieves the best balance between diarization accuracy and speech extraction quality. The paper introduces A-TSVAD with transformer-based activity detection and demonstrates that cascaded fusion improves DER/JER, SI-SNR, and PESQ over strong baselines like VBx, Pyannote, and TSVAD, especially under high overlap and target-absence conditions. These results suggest a practical pathway for robust, personalized diarization and extraction in real-world, multi-speaker environments, with future work addressing enrollment and recording-condition variability.

Abstract

Target speaker extraction (TSE) aims to extract the target speaker's voice from the input mixture. Previous studies have concentrated on high-overlapping scenarios. However, real-world applications usually meet more complex scenarios like variable speaker overlapping and target speaker absence. In this paper, we introduces a framework to perform continuous TSE (C-TSE), comprising a target speaker voice activation detection (TSVAD) and a TSE model. This framework significantly improves TSE performance on similar speakers and enhances personalization, which is lacking in traditional diarization methods. In detail, unlike conventional TSVAD deployed to refine the diarization results, the proposed Attention-target speaker voice activation detection (A-TSVAD) directly generates timestamps of the target speaker. We also explore some different integration methods of A-TSVAD and TSE by comparing the cascaded and parallel methods. The framework's effectiveness is assessed using a range of metrics, including diarization and enhancement metrics. Our experiments demonstrate that A-TSVAD outperforms conventional methods in reducing diarization errors. Furthermore, the integration of A-TSVAD and TSE in a sequential cascaded manner further enhances extraction accuracy.

Continuous Target Speech Extraction: Enhancing Personalized Diarization and Extraction on Complex Recordings

TL;DR

This work tackles continuous target speaker extraction in complex, long-form recordings with variable overlap and potential target absence by proposing C-TSE, a two-network framework that combines pBSRNN for extraction with A-TSVAD for precise target activity detection. It systematically evaluates three integration schemes—cascade1, cascade2, and parallel—finding that a cascaded fusion (A-TSVAD followed by pBSRNN) achieves the best balance between diarization accuracy and speech extraction quality. The paper introduces A-TSVAD with transformer-based activity detection and demonstrates that cascaded fusion improves DER/JER, SI-SNR, and PESQ over strong baselines like VBx, Pyannote, and TSVAD, especially under high overlap and target-absence conditions. These results suggest a practical pathway for robust, personalized diarization and extraction in real-world, multi-speaker environments, with future work addressing enrollment and recording-condition variability.

Abstract

Target speaker extraction (TSE) aims to extract the target speaker's voice from the input mixture. Previous studies have concentrated on high-overlapping scenarios. However, real-world applications usually meet more complex scenarios like variable speaker overlapping and target speaker absence. In this paper, we introduces a framework to perform continuous TSE (C-TSE), comprising a target speaker voice activation detection (TSVAD) and a TSE model. This framework significantly improves TSE performance on similar speakers and enhances personalization, which is lacking in traditional diarization methods. In detail, unlike conventional TSVAD deployed to refine the diarization results, the proposed Attention-target speaker voice activation detection (A-TSVAD) directly generates timestamps of the target speaker. We also explore some different integration methods of A-TSVAD and TSE by comparing the cascaded and parallel methods. The framework's effectiveness is assessed using a range of metrics, including diarization and enhancement metrics. Our experiments demonstrate that A-TSVAD outperforms conventional methods in reducing diarization errors. Furthermore, the integration of A-TSVAD and TSE in a sequential cascaded manner further enhances extraction accuracy.
Paper Structure (20 sections, 14 equations, 6 figures, 2 tables)

This paper contains 20 sections, 14 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Input graphical representation of the continuum target speaker extraction task. The active speaker situation can be classified into the following four special scenarios: Scenario A, both the target speaker and the interfering speaker are in active speech; Scenario B, only the target speaker is active and there is no interference from any other speaker; Scenario C, the input mixture contains multiple speakers, but none of them is the target speaker; Scenario D, there is no human voice in the input speech.
  • Figure 2: The diagram of the pBSRNN system. (A) The band split module. (B) The band and sequence modeling module. (C) The mask estimation module. (D) The speaker encoder module.
  • Figure 3: A-TSVAD network
  • Figure 4: Schematic diagram of module integration.
  • Figure 5: SI-SNR performance of models at different overlap rates
  • ...and 1 more figures