Table of Contents
Fetching ...

A lightweight dual-stage framework for personalized speech enhancement based on DeepFilterNet2

Thomas Serre, Mathieu Fontaine, Éric Benhaim, Geoffroy Dutour, Slim Essid

TL;DR

This work addresses extracting a target speaker's voice in noisy, multi-speaker environments by personalizing a lightweight dual-stage speech enhancement framework (DeepFilterNet2) using a frozen ECAPA-TDNN speaker encoder. It systematically compares embedding integration strategies (unified vs dual encoder) and three personalization variants, using a composite loss that combines spectral, multi-resolution, and over-suppression terms. The results show that personalization improves performance over the non-personalized baseline, with the unified encoder providing the best balance of gains and computational efficiency, and that the lightweight pDeepFilterNet2 approaches state-of-the-art performance at a fraction of the parameter and MACs cost. This makes real-time PSE feasible on embedded devices for applications like calls in noisy environments and hearing assistance, while still leaving room to bridge the gap to larger, more accurate models.

Abstract

Isolating the desired speaker's voice amidst multiplespeakers in a noisy acoustic context is a challenging task. Per-sonalized speech enhancement (PSE) endeavours to achievethis by leveraging prior knowledge of the speaker's voice.Recent research efforts have yielded promising PSE mod-els, albeit often accompanied by computationally intensivearchitectures, unsuitable for resource-constrained embeddeddevices. In this paper, we introduce a novel method to per-sonalize a lightweight dual-stage Speech Enhancement (SE)model and implement it within DeepFilterNet2, a SE modelrenowned for its state-of-the-art performance. We seek anoptimal integration of speaker information within the model,exploring different positions for the integration of the speakerembeddings within the dual-stage enhancement architec-ture. We also investigate a tailored training strategy whenadapting DeepFilterNet2 to a PSE task. We show that ourpersonalization method greatly improves the performancesof DeepFilterNet2 while preserving minimal computationaloverhead.

A lightweight dual-stage framework for personalized speech enhancement based on DeepFilterNet2

TL;DR

This work addresses extracting a target speaker's voice in noisy, multi-speaker environments by personalizing a lightweight dual-stage speech enhancement framework (DeepFilterNet2) using a frozen ECAPA-TDNN speaker encoder. It systematically compares embedding integration strategies (unified vs dual encoder) and three personalization variants, using a composite loss that combines spectral, multi-resolution, and over-suppression terms. The results show that personalization improves performance over the non-personalized baseline, with the unified encoder providing the best balance of gains and computational efficiency, and that the lightweight pDeepFilterNet2 approaches state-of-the-art performance at a fraction of the parameter and MACs cost. This makes real-time PSE feasible on embedded devices for applications like calls in noisy environments and hearing assistance, while still leaving room to bridge the gap to larger, more accurate models.

Abstract

Isolating the desired speaker's voice amidst multiplespeakers in a noisy acoustic context is a challenging task. Per-sonalized speech enhancement (PSE) endeavours to achievethis by leveraging prior knowledge of the speaker's voice.Recent research efforts have yielded promising PSE mod-els, albeit often accompanied by computationally intensivearchitectures, unsuitable for resource-constrained embeddeddevices. In this paper, we introduce a novel method to per-sonalize a lightweight dual-stage Speech Enhancement (SE)model and implement it within DeepFilterNet2, a SE modelrenowned for its state-of-the-art performance. We seek anoptimal integration of speaker information within the model,exploring different positions for the integration of the speakerembeddings within the dual-stage enhancement architec-ture. We also investigate a tailored training strategy whenadapting DeepFilterNet2 to a PSE task. We show that ourpersonalization method greatly improves the performancesof DeepFilterNet2 while preserving minimal computationaloverhead.
Paper Structure (16 sections, 5 equations, 2 figures, 2 tables)

This paper contains 16 sections, 5 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Personalized DeepFilterNet2 with unified encoder (top) - Personalized DeepFilterNet2 with dual encoder (bottom) - E represents the embedding and C is the concatenation operation.
  • Figure 2: Box plot featuring the PESQ for DeepFilterNet2 and pDeepFilterNet2 (unified encoder version) for different noise types: primary speaker + noise (pn), primary speaker + secondary speaker (ps), primary speaker + secondary speaker + noise (psn).