PAS-SE: Personalized Auxiliary-Sensor Speech Enhancement for Voice Pickup in Hearables
Mattes Ohlenbusch, Mikolaj Kegler, Marko Stamenovic
TL;DR
This work systematically compares personalized speech enhancement (PSE) and auxiliary-sensor speech enhancement (AS-SE) for hearables and introduces PAS-SE, which combines enrollment-based personalization with auxiliary-sensor input. Using Vibravox and Oldenburg datasets, it shows that training-time augmentations substantially improve AS-SE generalization across datasets, while PAS-SE delivers robust cross-domain gains and remains effective when enrollment speech is noisy, especially when enrollment comes from the in-ear mic. The results demonstrate complementary benefits between PSE and AS-SE, with PAS-SE achieving the best overall performance, including cross-dataset generalization and resilience to enrollment noise. The findings suggest a practical path toward device-agnostic AS-SE systems in wearables, enabling more reliable own-voice extraction across diverse hardware and environments.
Abstract
Speech enhancement for voice pickup in hearables aims to improve the user's voice by suppressing noise and interfering talkers, while maintaining own-voice quality. For single-channel methods, it is particularly challenging to distinguish the target from interfering talkers without additional context. In this paper, we compare two strategies to resolve this ambiguity: personalized speech enhancement (PSE), which uses enrollment utterances to represent the target, and auxiliary-sensor speech enhancement (AS-SE), which uses in-ear microphones as additional input. We evaluate the strategies on two public datasets, employing different auxiliary sensor arrays, to investigate their cross-dataset generalization. We propose training-time augmentations to facilitate cross-dataset generalization of AS-SE systems. We also show that combining PSE and AS-SE (PAS-SE) provides complementary performance benefits, especially when enrollment speech is recorded with the in-ear microphone. We further demonstrate that PAS-SE personalized with noisy in-ear enrollments maintains performance benefits over the AS-SE system.
