Table of Contents
Fetching ...

Configurable EBEN: Extreme Bandwidth Extension Network to enhance body-conducted speech capture

Julien Hauret, Thomas Joubaud, Véronique Zimpfer, Éric Bavu

TL;DR

Configurable EBEN addresses bandwidth-limited speech captured by body-conduction microphones (BCMs) by introducing a multiband PQMF-based decomposition and a lightweight generator that operates on $P$ informative bands within $M$ PQMF bands, coupled with band-focused discriminators for realism. The model uses adversarial training with a feature-matching reconstruction loss to extend bandwidth while denoising, achieving real-time performance with a latency of about $\tau=4.3$ ms and a memory footprint around $\delta=20$ MB. Extensive experiments on synthetic BCM degradations, plus subjective MUSHRA testing, show EBEN is competitive with or superior to existing baselines in quality and intelligibility, while offering tunable configurability via $M$, $P$, and $Q$ to accommodate different BCM characteristics. The study also analyzes metric correlations, highlighting STOI and Noresqa-MOS as more reliable proxies for perception in this bandwidth-extension task and discusses the limitations of simulated degradations, advocating future evaluation on real BCM datasets.

Abstract

This paper presents a configurable version of Extreme Bandwidth Extension Network (EBEN), a Generative Adversarial Network (GAN) designed to improve audio captured with body-conduction microphones. We show that although these microphones significantly reduce environmental noise, this insensitivity to ambient noise happens at the expense of the bandwidth of the speech signal acquired by the wearer of the devices. The obtained captured signals therefore require the use of signal enhancement techniques to recover the full-bandwidth speech. EBEN leverages a configurable multiband decomposition of the raw captured signal. This decomposition allows the data time domain dimensions to be reduced and the full band signal to be better controlled. The multiband representation of the captured signal is processed through a U-Net-like model, which combines feature and adversarial losses to generate an enhanced speech signal. We also benefit from this original representation in the proposed configurable discriminators architecture. The configurable EBEN approach can achieve state-of-the-art enhancement results on synthetic data with a lightweight generator that allows real-time processing.

Configurable EBEN: Extreme Bandwidth Extension Network to enhance body-conducted speech capture

TL;DR

Configurable EBEN addresses bandwidth-limited speech captured by body-conduction microphones (BCMs) by introducing a multiband PQMF-based decomposition and a lightweight generator that operates on informative bands within PQMF bands, coupled with band-focused discriminators for realism. The model uses adversarial training with a feature-matching reconstruction loss to extend bandwidth while denoising, achieving real-time performance with a latency of about ms and a memory footprint around MB. Extensive experiments on synthetic BCM degradations, plus subjective MUSHRA testing, show EBEN is competitive with or superior to existing baselines in quality and intelligibility, while offering tunable configurability via , , and to accommodate different BCM characteristics. The study also analyzes metric correlations, highlighting STOI and Noresqa-MOS as more reliable proxies for perception in this bandwidth-extension task and discusses the limitations of simulated degradations, advocating future evaluation on real BCM datasets.

Abstract

This paper presents a configurable version of Extreme Bandwidth Extension Network (EBEN), a Generative Adversarial Network (GAN) designed to improve audio captured with body-conduction microphones. We show that although these microphones significantly reduce environmental noise, this insensitivity to ambient noise happens at the expense of the bandwidth of the speech signal acquired by the wearer of the devices. The obtained captured signals therefore require the use of signal enhancement techniques to recover the full-bandwidth speech. EBEN leverages a configurable multiband decomposition of the raw captured signal. This decomposition allows the data time domain dimensions to be reduced and the full band signal to be better controlled. The multiband representation of the captured signal is processed through a U-Net-like model, which combines feature and adversarial losses to generate an enhanced speech signal. We also benefit from this original representation in the proposed configurable discriminators architecture. The configurable EBEN approach can achieve state-of-the-art enhancement results on synthetic data with a lightweight generator that allows real-time processing.
Paper Structure (26 sections, 7 equations, 11 figures, 3 tables)

This paper contains 26 sections, 7 equations, 11 figures, 3 tables.

Figures (11)

  • Figure 1: A/B testing results: in-ear vs traditional microphones. The p-values shown at the top of each bar indicate the significance of the preferred microphone.
  • Figure 2: Transfer function of the in-ear transducer
  • Figure 3: Coherence function of the in-ear transducer
  • Figure 4: Time domain representation of speech signals captured in a quiet environment. Active speech is presented in green area.
  • Figure 5: PQMF Analysis and Synthesis : block-diagram
  • ...and 6 more figures