Table of Contents
Fetching ...

Interpretable Binaural Deep Beamforming Guided by Time-Varying Relative Transfer Function

Ilai Zaidel, Sharon Gannot

TL;DR

Results show that RTF guidance yields smoother, more spatially consistent beampatterns that track the target direction of arrival (DOA), whereas the unguided model fails to maintain a clear spatial focus.

Abstract

In this work, we propose a deep beamforming framework for speech enhancement in dynamic acoustic environments. The framework learns time-varying beamformer weights from noisy multichannel signals via a deep neural network, guided by a continuously tracked relative transfer function (RTF) of a moving target speaker. We analyze the network's spatial behavior on an 8-microphone linear array by evaluating narrowband and wideband beampatterns in three modes: (i) oracle guidance with true RTFs, (ii) guidance with subspace-tracked RTF estimates, and (iii) operation without RTF guidance. Results show that RTF guidance yields smoother, more spatially consistent beampatterns that track the target direction of arrival (DOA), whereas the unguided model fails to maintain a clear spatial focus. We further extend the framework to binaural beamforming for dynamic target-speaker enhancement. The system is trained using a head-related transfer function (HRTF)-based acoustic simulation of a moving source, enabling realistic spatial rendering at the left and right ears. Spatial cue preservation is quantitatively evaluated in terms of interaural level differences (ILD) and interaural time differences (ITD), demonstrating the method's suitability for hearable applications.

Interpretable Binaural Deep Beamforming Guided by Time-Varying Relative Transfer Function

TL;DR

Results show that RTF guidance yields smoother, more spatially consistent beampatterns that track the target direction of arrival (DOA), whereas the unguided model fails to maintain a clear spatial focus.

Abstract

In this work, we propose a deep beamforming framework for speech enhancement in dynamic acoustic environments. The framework learns time-varying beamformer weights from noisy multichannel signals via a deep neural network, guided by a continuously tracked relative transfer function (RTF) of a moving target speaker. We analyze the network's spatial behavior on an 8-microphone linear array by evaluating narrowband and wideband beampatterns in three modes: (i) oracle guidance with true RTFs, (ii) guidance with subspace-tracked RTF estimates, and (iii) operation without RTF guidance. Results show that RTF guidance yields smoother, more spatially consistent beampatterns that track the target direction of arrival (DOA), whereas the unguided model fails to maintain a clear spatial focus. We further extend the framework to binaural beamforming for dynamic target-speaker enhancement. The system is trained using a head-related transfer function (HRTF)-based acoustic simulation of a moving source, enabling realistic spatial rendering at the left and right ears. Spatial cue preservation is quantitatively evaluated in terms of interaural level differences (ILD) and interaural time differences (ITD), demonstrating the method's suitability for hearable applications.

Paper Structure

This paper contains 17 sections, 8 equations, 5 figures, 1 table, 1 algorithm.

Figures (5)

  • Figure 1: Overview of the proposed dual-branch beamforming network.
  • Figure 2: ILD/ITD graphs comparing the three model variants. Graphs produced by faller2004source (center frequency $f_{\rm c}=500~\mathrm{Hz}$).
  • Figure 3: Narrowband time-varying beampattern, at four time snapshots, using RTFs estimated by PAST.
  • Figure 4: Narrowband time-varying beampattern at four time snapshots, with no RTF guidance.
  • Figure 5: Time-varying wideband beampattern (dB), shown at four time snapshots, with PAST RTF.