Table of Contents
Fetching ...

PAS-SE: Personalized Auxiliary-Sensor Speech Enhancement for Voice Pickup in Hearables

Mattes Ohlenbusch, Mikolaj Kegler, Marko Stamenovic

TL;DR

This work systematically compares personalized speech enhancement (PSE) and auxiliary-sensor speech enhancement (AS-SE) for hearables and introduces PAS-SE, which combines enrollment-based personalization with auxiliary-sensor input. Using Vibravox and Oldenburg datasets, it shows that training-time augmentations substantially improve AS-SE generalization across datasets, while PAS-SE delivers robust cross-domain gains and remains effective when enrollment speech is noisy, especially when enrollment comes from the in-ear mic. The results demonstrate complementary benefits between PSE and AS-SE, with PAS-SE achieving the best overall performance, including cross-dataset generalization and resilience to enrollment noise. The findings suggest a practical path toward device-agnostic AS-SE systems in wearables, enabling more reliable own-voice extraction across diverse hardware and environments.

Abstract

Speech enhancement for voice pickup in hearables aims to improve the user's voice by suppressing noise and interfering talkers, while maintaining own-voice quality. For single-channel methods, it is particularly challenging to distinguish the target from interfering talkers without additional context. In this paper, we compare two strategies to resolve this ambiguity: personalized speech enhancement (PSE), which uses enrollment utterances to represent the target, and auxiliary-sensor speech enhancement (AS-SE), which uses in-ear microphones as additional input. We evaluate the strategies on two public datasets, employing different auxiliary sensor arrays, to investigate their cross-dataset generalization. We propose training-time augmentations to facilitate cross-dataset generalization of AS-SE systems. We also show that combining PSE and AS-SE (PAS-SE) provides complementary performance benefits, especially when enrollment speech is recorded with the in-ear microphone. We further demonstrate that PAS-SE personalized with noisy in-ear enrollments maintains performance benefits over the AS-SE system.

PAS-SE: Personalized Auxiliary-Sensor Speech Enhancement for Voice Pickup in Hearables

TL;DR

This work systematically compares personalized speech enhancement (PSE) and auxiliary-sensor speech enhancement (AS-SE) for hearables and introduces PAS-SE, which combines enrollment-based personalization with auxiliary-sensor input. Using Vibravox and Oldenburg datasets, it shows that training-time augmentations substantially improve AS-SE generalization across datasets, while PAS-SE delivers robust cross-domain gains and remains effective when enrollment speech is noisy, especially when enrollment comes from the in-ear mic. The results demonstrate complementary benefits between PSE and AS-SE, with PAS-SE achieving the best overall performance, including cross-dataset generalization and resilience to enrollment noise. The findings suggest a practical path toward device-agnostic AS-SE systems in wearables, enabling more reliable own-voice extraction across diverse hardware and environments.

Abstract

Speech enhancement for voice pickup in hearables aims to improve the user's voice by suppressing noise and interfering talkers, while maintaining own-voice quality. For single-channel methods, it is particularly challenging to distinguish the target from interfering talkers without additional context. In this paper, we compare two strategies to resolve this ambiguity: personalized speech enhancement (PSE), which uses enrollment utterances to represent the target, and auxiliary-sensor speech enhancement (AS-SE), which uses in-ear microphones as additional input. We evaluate the strategies on two public datasets, employing different auxiliary sensor arrays, to investigate their cross-dataset generalization. We propose training-time augmentations to facilitate cross-dataset generalization of AS-SE systems. We also show that combining PSE and AS-SE (PAS-SE) provides complementary performance benefits, especially when enrollment speech is recorded with the in-ear microphone. We further demonstrate that PAS-SE personalized with noisy in-ear enrollments maintains performance benefits over the AS-SE system.

Paper Structure

This paper contains 12 sections, 1 equation, 2 figures, 4 tables.

Figures (2)

  • Figure 1: PAS-SE system architecture based on FT-JNF tesch_insights_2023. The system is personalized using multiplicative conditioning with a feature vector $\mathbf{e}$ obtained from an enrollment utterance $\tilde{Y}_\text{enroll}$.
  • Figure 2: Cross-dataset interferer reduction performance (V) achieved by SE, PSE, AS-SE, and PAS-SE systems at different enrollment utterance SNRs ($-\infty$: only noise, no speech, $\infty$: clean speech).