Table of Contents
Fetching ...

Binaural Target Speaker Extraction using Individualized HRTF

Yoav Ellinson, Sharon Gannot

TL;DR

This work proposes a novel approach that leverages the individual listener's Head-Related Transfer Function (HRTF) to isolate the target speaker, and employs a fully complex-valued neural network that operates directly on the complex-valued Short-Time Fourier transform (STFT) of the mixed audio signals.

Abstract

In this work, we address the problem of binaural target-speaker extraction in the presence of multiple simultane-ous talkers. We propose a novel approach that leverages the individual listener's Head-Related Transfer Function (HRTF) to isolate the target speaker. The proposed method is speaker-independent, as it does not rely on speaker embeddings. We employ a fully complex-valued neural network that operates directly on the complex-valued Short-Time Fourier transform (STFT) of the mixed audio signals, and compare it to a Real-Imaginary (RI)-based neural network, demonstrating the advantages of the former. We first evaluate the method in an anechoic, noise-free scenario, achieving excellent extraction performance while preserving the binaural cues of the target signal. We then extend the evaluation to reverberant conditions. Our method proves robust, maintaining speech clarity and source directionality while simultaneously reducing reverberation. A comparative analysis with existing binaural Target Speaker Extraction (TSE) methods shows that the proposed approach achieves performance comparable to state-of-the-art techniques in terms of noise reduction and perceptual quality, while providing a clear advantage in preserving binaural cues. Demo-page: https://bi-ctse-hrtf.github.io

Binaural Target Speaker Extraction using Individualized HRTF

TL;DR

This work proposes a novel approach that leverages the individual listener's Head-Related Transfer Function (HRTF) to isolate the target speaker, and employs a fully complex-valued neural network that operates directly on the complex-valued Short-Time Fourier transform (STFT) of the mixed audio signals.

Abstract

In this work, we address the problem of binaural target-speaker extraction in the presence of multiple simultane-ous talkers. We propose a novel approach that leverages the individual listener's Head-Related Transfer Function (HRTF) to isolate the target speaker. The proposed method is speaker-independent, as it does not rely on speaker embeddings. We employ a fully complex-valued neural network that operates directly on the complex-valued Short-Time Fourier transform (STFT) of the mixed audio signals, and compare it to a Real-Imaginary (RI)-based neural network, demonstrating the advantages of the former. We first evaluate the method in an anechoic, noise-free scenario, achieving excellent extraction performance while preserving the binaural cues of the target signal. We then extend the evaluation to reverberant conditions. Our method proves robust, maintaining speech clarity and source directionality while simultaneously reducing reverberation. A comparative analysis with existing binaural Target Speaker Extraction (TSE) methods shows that the proposed approach achieves performance comparable to state-of-the-art techniques in terms of noise reduction and perceptual quality, while providing a clear advantage in preserving binaural cues. Demo-page: https://bi-ctse-hrtf.github.io

Paper Structure

This paper contains 12 sections, 6 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: A block diagram of the proposed method, where $\bm{h}_{\text{hrtf}}{(\theta_d,\phi_d,k)}$ denotes the HRTF of the desired speaker’s DOA, $\bm{x}_b$ represents the mixed binaural signal, and $\hat{\tilde{\bm{s}}}_d$ represents the estimated desired signal, both in the STFT domain.
  • Figure 2: A simulation illustration: two concurrent speakers: the desired speaker at $\theta_d = 40^\circ$ (left) and the interferer at $\theta_i = -30^\circ$ (right), both at elevation $\phi$. Images source: https://www.freepik.com
  • Figure 3: The joint p.d.f. of ITD [ms] and ILD [dB] for $\theta_d = -60^\circ$, $\theta_i = -90^\circ$, and $\phi_{d,i} = -10^\circ$, with $T_{60}=0.63$ s, for the frequency band centerd at 500 Hz, where the ILD is less pronounced. $\tilde{\bm{s}}_d$ and $\tilde{\bm{s}}_i$ denotes the anechoic target signals. Graphs produced by faller2004source.
  • Figure 4: A polar plot of SI-SDR [dB] versus DOA, with the target at $\theta_d=-75^\circ$ (red) and the interferer at $\theta_i=-10^\circ$ (dashed blue). Stars mark true DOAs;