Table of Contents
Fetching ...

Binaural Angular Separation Network

Yang Yang, George Sung, Shao-Fu Shih, Hakan Erdogan, Chehung Lee, Matthias Grundmann

TL;DR

Binaural Angular Separation Network (BASNet) addresses directional speech separation using two microphones by exploiting fixed angular regions and consistent $TDOA$ cues. It trains an end-to-end convolutional U-Net on simulated room impulse responses generated via the image method to synthesize multi-channel inputs, leveraging delay (IPD/TDOA) information for separation. BASNet demonstrates on-device real-time performance and generalizes across devices with different microphone geometries, outperforming prior neural beamforming approaches and enabling steerable directivity. The method enables robust, low-latency enhancement suitable for telephony and video conferencing, with the ability to focus on targeted spatial regions through controllable input latency.

Abstract

We propose a neural network model that can separate target speech sources from interfering sources at different angular regions using two microphones. The model is trained with simulated room impulse responses (RIRs) using omni-directional microphones without needing to collect real RIRs. By relying on specific angular regions and multiple room simulations, the model utilizes consistent time difference of arrival (TDOA) cues, or what we call delay contrast, to separate target and interference sources while remaining robust in various reverberation environments. We demonstrate the model is not only generalizable to a commercially available device with a slightly different microphone geometry, but also outperforms our previous work which uses one additional microphone on the same device. The model runs in real-time on-device and is suitable for low-latency streaming applications such as telephony and video conferencing.

Binaural Angular Separation Network

TL;DR

Binaural Angular Separation Network (BASNet) addresses directional speech separation using two microphones by exploiting fixed angular regions and consistent cues. It trains an end-to-end convolutional U-Net on simulated room impulse responses generated via the image method to synthesize multi-channel inputs, leveraging delay (IPD/TDOA) information for separation. BASNet demonstrates on-device real-time performance and generalizes across devices with different microphone geometries, outperforming prior neural beamforming approaches and enabling steerable directivity. The method enables robust, low-latency enhancement suitable for telephony and video conferencing, with the ability to focus on targeted spatial regions through controllable input latency.

Abstract

We propose a neural network model that can separate target speech sources from interfering sources at different angular regions using two microphones. The model is trained with simulated room impulse responses (RIRs) using omni-directional microphones without needing to collect real RIRs. By relying on specific angular regions and multiple room simulations, the model utilizes consistent time difference of arrival (TDOA) cues, or what we call delay contrast, to separate target and interference sources while remaining robust in various reverberation environments. We demonstrate the model is not only generalizable to a commercially available device with a slightly different microphone geometry, but also outperforms our previous work which uses one additional microphone on the same device. The model runs in real-time on-device and is suitable for low-latency streaming applications such as telephony and video conferencing.
Paper Structure (12 sections, 2 equations, 3 figures, 3 tables)

This paper contains 12 sections, 2 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: RIR simulation setup for target and interference sources. Target signal sources is confined to $[-\theta, +\theta]$ and $[180^{\circ}-\theta, 180^{\circ}+\theta]$; interference sources is confined to $[90^{\circ}-\phi, 180^{\circ}-\phi]$ and $[180^{\circ}+\phi, 360^{\circ}-\phi]$. Noise source can come from any of the $360^{\circ}$ directions. Distance between the two microphones is denoted as $d$.
  • Figure 2: Listening room setup.
  • Figure 3: Directivity pattern with different sample offsets.