Table of Contents
Fetching ...

DFingerNet: Noise-Adaptive Speech Enhancement for Hearing Aids

Iosif Tsangko, Andreas Triantafyllopoulos, Michael Müller, Hendrik Schröter, Björn W. Schuller

TL;DR

This work tackles the challenge of adapting lightweight hearing-aid speech enhancement models to diverse acoustic environments. It introduces DFingerNet (DFiN), an extension of the DFN architecture that adds a dedicated fingerprint encoder to condition denoising on environment-specific noise fingerprints, with fusion strategies including additive and attention-based methods, while keeping the main model pretrained. The dataset strategy crops the first second of noise as fingerprints and mixes noise at random SNRs, evaluating on VCTK with DEMAND, FSD50k, and ESC-50. Results show notable gains in SI-SDR, PESQ, STOI, and DNSMOS, robust performance under fingerprint mismatches and distribution shifts, and practical feasibility due to off-device fingerprint processing and optional usage.

Abstract

The DeepFilterNet (DFN) architecture was recently proposed as a deep learning model suited for hearing aid devices. Despite its competitive performance on numerous benchmarks, it still follows a `one-size-fits-all' approach, which aims to train a single, monolithic architecture that generalises across different noises and environments. However, its limited size and computation budget can hamper its generalisability. Recent work has shown that in-context adaptation can improve performance by conditioning the denoising process on additional information extracted from background recordings to mitigate this. These recordings can be offloaded outside the hearing aid, thus improving performance while adding minimal computational overhead. We introduce these principles to the DFN model, thus proposing the DFingerNet (DFiN) model, which shows superior performance on various benchmarks inspired by the DNS Challenge.

DFingerNet: Noise-Adaptive Speech Enhancement for Hearing Aids

TL;DR

This work tackles the challenge of adapting lightweight hearing-aid speech enhancement models to diverse acoustic environments. It introduces DFingerNet (DFiN), an extension of the DFN architecture that adds a dedicated fingerprint encoder to condition denoising on environment-specific noise fingerprints, with fusion strategies including additive and attention-based methods, while keeping the main model pretrained. The dataset strategy crops the first second of noise as fingerprints and mixes noise at random SNRs, evaluating on VCTK with DEMAND, FSD50k, and ESC-50. Results show notable gains in SI-SDR, PESQ, STOI, and DNSMOS, robust performance under fingerprint mismatches and distribution shifts, and practical feasibility due to off-device fingerprint processing and optional usage.

Abstract

The DeepFilterNet (DFN) architecture was recently proposed as a deep learning model suited for hearing aid devices. Despite its competitive performance on numerous benchmarks, it still follows a `one-size-fits-all' approach, which aims to train a single, monolithic architecture that generalises across different noises and environments. However, its limited size and computation budget can hamper its generalisability. Recent work has shown that in-context adaptation can improve performance by conditioning the denoising process on additional information extracted from background recordings to mitigate this. These recordings can be offloaded outside the hearing aid, thus improving performance while adding minimal computational overhead. We introduce these principles to the DFN model, thus proposing the DFingerNet (DFiN) model, which shows superior performance on various benchmarks inspired by the DNS Challenge.
Paper Structure (4 sections, 5 equations, 3 figures, 3 tables)

This paper contains 4 sections, 5 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Architecture of the DFiN: The model processes input noise and noisy fingerprint through separate encoders for ERB and complex features. After fusion (\ref{['eq:additive']}), the features are decoded by respective ERB and DF decoders. The decoded ERB features are used to apply gains, while the noisy spectrum is filtered using MF filters. Combining gains with the filtered spectrum results in the enhanced spectrum estimate $E$, which is inverted to produce $e(t)$.
  • Figure 2: Stability analysis of the DFiN model on the VCTK-DEMAND dataset, showing $\Delta$SI-SDR as a function of the time difference (in seconds) between the noise mixture and noise fingerprints.
  • Figure 3: Evaluation on the high-level categories of the classes in the ESC dataset. The radar chart illustrates the $\Delta$SI-SDR performance boost across these high-level categories achieved by our model (DFiN) compared to the baseline (DFN).