End-to-End Multi-Microphone Speaker Extraction Using Relative Transfer Functions
Aviad Eisenberg, Sharon Gannot, Shlomo E. Chazan
TL;DR
The paper tackles end-to-end target speaker extraction from multi-microphone mixtures in reverberant environments by leveraging instantaneous RTF features derived from an enrollment utterance at the desired source location. It compares three enrollment cues—RTF, DOA, and spectral embeddings—and demonstrates through extensive simulations that RTF-based enrollment yields superior separation and robustness, often outperforming MVDR baselines, even when the competing speakers share the same DOA. The approach combines a multi-channel encoder–decoder architecture with per-feature enrollment encoders and a frame-wise fusion mechanism, trained with a time-domain SI-SDR loss and a data augmentation scheme. The work advances practical TSE by highlighting the value of spatial cues captured by RTF in reverberant and directional-noise scenarios, with implications for hearing aids, virtual assistants, and robust speech systems.
Abstract
This paper introduces a multi-microphone method for extracting a desired speaker from a mixture involving multiple speakers and directional noise in a reverberant environment. In this work, we propose leveraging the instantaneous relative transfer function (RTF), estimated from a reference utterance recorded in the same position as the desired source. The effectiveness of the RTF-based spatial cue is compared with direction of arrival (DOA)-based spatial cue and the conventional spectral embedding. Experimental results in challenging acoustic scenarios demonstrate that using spatial cues yields better performance than the spectral-based cue and that the instantaneous RTF outperforms the DOA-based spatial cue.
