Table of Contents
Fetching ...

SEF-PNet: Speaker Encoder-Free Personalized Speech Enhancement with Local and Global Contexts Aggregation

Ziling Huang, Haixin Guan, Haoran Wei, Yanhua Long

TL;DR

This work tackles personalized speech enhancement without relying on heavy speaker encoders by introducing SEF-PNet, an encoder-free network built on the sDPCCN backbone. It combines Interactive Speaker Adaptation (ISA), which iteratively fuses enrollment and noisy signals for robust target guidance, with Local and Global Context Aggregation (LCA) to enrich encoder representations with multi-scale context. The method achieves substantial performance gains over the baseline, including improvements in $SI$-SDR, $PESQ$, and $STOI$ on Libri2Mix, while reducing model size and complexity. The results suggest practical, efficient deployment potential for real-time PSE systems and provide a pathway to extend ISA and LCA to other architectures and larger-scale tasks.

Abstract

Personalized speech enhancement (PSE) methods typically rely on pre-trained speaker verification models or self-designed speaker encoders to extract target speaker clues, guiding the PSE model in isolating the desired speech. However, these approaches suffer from significant model complexity and often underutilize enrollment speaker information, limiting the potential performance of the PSE model. To address these limitations, we propose a novel Speaker Encoder-Free PSE network, termed SEF-PNet, which fully exploits the information present in both the enrollment speech and noisy mixtures. SEF-PNet incorporates two key innovations: Interactive Speaker Adaptation (ISA) and Local-Global Context Aggregation (LCA). ISA dynamically modulates the interactions between enrollment and noisy signals to enhance the speaker adaptation, while LCA employs advanced channel attention within the PSE encoder to effectively integrate local and global contextual information, thus improving feature learning. Experiments on the Libri2Mix dataset demonstrate that SEF-PNet significantly outperforms baseline models, achieving state-of-the-art PSE performance.

SEF-PNet: Speaker Encoder-Free Personalized Speech Enhancement with Local and Global Contexts Aggregation

TL;DR

This work tackles personalized speech enhancement without relying on heavy speaker encoders by introducing SEF-PNet, an encoder-free network built on the sDPCCN backbone. It combines Interactive Speaker Adaptation (ISA), which iteratively fuses enrollment and noisy signals for robust target guidance, with Local and Global Context Aggregation (LCA) to enrich encoder representations with multi-scale context. The method achieves substantial performance gains over the baseline, including improvements in -SDR, , and on Libri2Mix, while reducing model size and complexity. The results suggest practical, efficient deployment potential for real-time PSE systems and provide a pathway to extend ISA and LCA to other architectures and larger-scale tasks.

Abstract

Personalized speech enhancement (PSE) methods typically rely on pre-trained speaker verification models or self-designed speaker encoders to extract target speaker clues, guiding the PSE model in isolating the desired speech. However, these approaches suffer from significant model complexity and often underutilize enrollment speaker information, limiting the potential performance of the PSE model. To address these limitations, we propose a novel Speaker Encoder-Free PSE network, termed SEF-PNet, which fully exploits the information present in both the enrollment speech and noisy mixtures. SEF-PNet incorporates two key innovations: Interactive Speaker Adaptation (ISA) and Local-Global Context Aggregation (LCA). ISA dynamically modulates the interactions between enrollment and noisy signals to enhance the speaker adaptation, while LCA employs advanced channel attention within the PSE encoder to effectively integrate local and global contextual information, thus improving feature learning. Experiments on the Libri2Mix dataset demonstrate that SEF-PNet significantly outperforms baseline models, achieving state-of-the-art PSE performance.
Paper Structure (14 sections, 7 equations, 4 figures, 4 tables)

This paper contains 14 sections, 7 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Overview of the proposed SEF-PNet model. All colored blocks highlight our key contributions over the original sDPCCN.
  • Figure 2: Structure of the Iterative Feature Integration (IFI) included in the ISA module. GA and LA represent Global Attention and Local Attention, respectively, as designed in the LCA module in Fig.\ref{['fig:lca']}.
  • Figure 3: Structure of Local and Global Contexts Aggregation (LCA) module.
  • Figure 4: Spectrograms of (a) clean speech, (b) mixture speech, (c) sDPCCN and (d) SEF-PNet.