Table of Contents
Fetching ...

Binaural sound source localization using a hybrid time and frequency domain model

Gil Geva, Olivier Warusfel, Shlomo Dubnov, Tammuz Dubnov, Amir Amedi, Yacov Hel-Or

TL;DR

This work tackles full-sphere binaural sound source localization using a compact two-microphone setup by leveraging head-related transfer function cues with a deep-learning, end-to-end hybrid model that fuses time-domain waveforms and time-frequency spectrograms. A data pipeline built from KU100 dummy-head recordings and MUSDB18-derived HRIRs generates rich binaural datasets across 24 directions, enabling accurate spatial estimation via a three-branch network that merges waveform and spectrogram features. The authors report a dramatic improvement over prior work, achieving an average angular error of $0.24^\circ$ and an average Euclidean distance of $0.01$ m, versus the benchmark's $19.07^\circ$ and $1.08$ m, and show the approach is robust across directions and frequencies. They also propose future directions for head-agnostic localization, joint localization and denoising, and low-cost pinna-inspired designs, highlighting significant implications for robotics, VR, and CI applications.

Abstract

This paper introduces a new approach to sound source localization using head-related transfer function (HRTF) characteristics, which enable precise full-sphere localization from raw data. While previous research focused primarily on using extensive microphone arrays in the frontal plane, this arrangement often encountered limitations in accuracy and robustness when dealing with smaller microphone arrays. Our model proposes using both time and frequency domain for sound source localization while utilizing Deep Learning (DL) approach. The performance of our proposed model, surpasses the current state-of-the-art results. Specifically, it boasts an average angular error of $0.24 degrees and an average Euclidean distance of 0.01 meters, while the known state-of-the-art gives average angular error of 19.07 degrees and average Euclidean distance of 1.08 meters. This level of accuracy is of paramount importance for a wide range of applications, including robotics, virtual reality, and aiding individuals with cochlear implants (CI).

Binaural sound source localization using a hybrid time and frequency domain model

TL;DR

This work tackles full-sphere binaural sound source localization using a compact two-microphone setup by leveraging head-related transfer function cues with a deep-learning, end-to-end hybrid model that fuses time-domain waveforms and time-frequency spectrograms. A data pipeline built from KU100 dummy-head recordings and MUSDB18-derived HRIRs generates rich binaural datasets across 24 directions, enabling accurate spatial estimation via a three-branch network that merges waveform and spectrogram features. The authors report a dramatic improvement over prior work, achieving an average angular error of and an average Euclidean distance of m, versus the benchmark's and m, and show the approach is robust across directions and frequencies. They also propose future directions for head-agnostic localization, joint localization and denoising, and low-cost pinna-inspired designs, highlighting significant implications for robotics, VR, and CI applications.

Abstract

This paper introduces a new approach to sound source localization using head-related transfer function (HRTF) characteristics, which enable precise full-sphere localization from raw data. While previous research focused primarily on using extensive microphone arrays in the frontal plane, this arrangement often encountered limitations in accuracy and robustness when dealing with smaller microphone arrays. Our model proposes using both time and frequency domain for sound source localization while utilizing Deep Learning (DL) approach. The performance of our proposed model, surpasses the current state-of-the-art results. Specifically, it boasts an average angular error of $0.24 degrees and an average Euclidean distance of 0.01 meters, while the known state-of-the-art gives average angular error of 19.07 degrees and average Euclidean distance of 1.08 meters. This level of accuracy is of paramount importance for a wide range of applications, including robotics, virtual reality, and aiding individuals with cochlear implants (CI).
Paper Structure (13 sections, 1 equation, 6 figures, 2 tables)

This paper contains 13 sections, 1 equation, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Recording studio and KU100 Dummy head
  • Figure 2: Hybrid time and frequency domain model architecture
  • Figure 3: Mean angular error in degrees for each speaker. The model' average angular error is $0.24^\circ$ while the benchmark's average angular error is $19.07^\circ$
  • Figure 4: Mean angular error in degrees for each frequency range in kHz. We can see the high outlier at 3kHz of $2.7 ^\circ$ and the other frequencies below or around $0.5 ^\circ$
  • Figure 5: Left graph shows the mean angular error for each speaker. Right graph present example of source location (blue line) and the model's prediction (orange line).
  • ...and 1 more figures