Table of Contents
Fetching ...

Hierarchical speaker representation for target speaker extraction

Shulin He, Huaiwen Zhang, Wei Rao, Kanghao Zhang, Yukai Ju, Yang Yang, Xueliang Zhang

TL;DR

The paper tackles target speaker extraction by addressing the limitation of using a single, dense speaker embedding. It introduces Hierarchical Representation (HR), which fuses local anchor-derived features with a global speaker embedding and integrates them across multiple layers of a CRN-based separator (HR-TSE) using ARN and ECAPA-TDNN components. Empirical results on Libri-2talker and DNS Challenge data show that HR substantially improves standard metrics (e.g., SI-SNR, STOI, ESTOI, PESQ) and that combining local and global features yields the largest gains, with TEA-PSE benefiting notably from HR. The findings demonstrate that hierarchical, multi-level anchor information can significantly enhance target speaker isolation and generalize across models, achieving state-of-the-art performance on challenging benchmarks.

Abstract

Target speaker extraction aims to isolate a specific speaker's voice from a composite of multiple sound sources, guided by an enrollment utterance or called anchor. Current methods predominantly derive speaker embeddings from the anchor and integrate them into the separation network to separate the voice of the target speaker. However, the representation of the speaker embedding is too simplistic, often being merely a 1*1024 vector. This dense information makes it difficult for the separation network to harness effectively. To address this limitation, we introduce a pioneering methodology called Hierarchical Representation (HR) that seamlessly fuses anchor data across granular and overarching 5 layers of the separation network, enhancing the precision of target extraction. HR amplifies the efficacy of anchors to improve target speaker isolation. On the Libri-2talker dataset, HR substantially outperforms state-of-the-art time-frequency domain techniques. Further demonstrating HR's capabilities, we achieved first place in the prestigious ICASSP 2023 Deep Noise Suppression Challenge. The proposed HR methodology shows great promise for advancing target speaker extraction through enhanced anchor utilization.

Hierarchical speaker representation for target speaker extraction

TL;DR

The paper tackles target speaker extraction by addressing the limitation of using a single, dense speaker embedding. It introduces Hierarchical Representation (HR), which fuses local anchor-derived features with a global speaker embedding and integrates them across multiple layers of a CRN-based separator (HR-TSE) using ARN and ECAPA-TDNN components. Empirical results on Libri-2talker and DNS Challenge data show that HR substantially improves standard metrics (e.g., SI-SNR, STOI, ESTOI, PESQ) and that combining local and global features yields the largest gains, with TEA-PSE benefiting notably from HR. The findings demonstrate that hierarchical, multi-level anchor information can significantly enhance target speaker isolation and generalize across models, achieving state-of-the-art performance on challenging benchmarks.

Abstract

Target speaker extraction aims to isolate a specific speaker's voice from a composite of multiple sound sources, guided by an enrollment utterance or called anchor. Current methods predominantly derive speaker embeddings from the anchor and integrate them into the separation network to separate the voice of the target speaker. However, the representation of the speaker embedding is too simplistic, often being merely a 1*1024 vector. This dense information makes it difficult for the separation network to harness effectively. To address this limitation, we introduce a pioneering methodology called Hierarchical Representation (HR) that seamlessly fuses anchor data across granular and overarching 5 layers of the separation network, enhancing the precision of target extraction. HR amplifies the efficacy of anchors to improve target speaker isolation. On the Libri-2talker dataset, HR substantially outperforms state-of-the-art time-frequency domain techniques. Further demonstrating HR's capabilities, we achieved first place in the prestigious ICASSP 2023 Deep Noise Suppression Challenge. The proposed HR methodology shows great promise for advancing target speaker extraction through enhanced anchor utilization.
Paper Structure (11 sections, 8 equations, 1 figure, 4 tables)

This paper contains 11 sections, 8 equations, 1 figure, 4 tables.

Figures (1)

  • Figure 1: Overview of the proposed HR-TSE system.