Binaural Angular Separation Network
Yang Yang, George Sung, Shao-Fu Shih, Hakan Erdogan, Chehung Lee, Matthias Grundmann
TL;DR
Binaural Angular Separation Network (BASNet) addresses directional speech separation using two microphones by exploiting fixed angular regions and consistent $TDOA$ cues. It trains an end-to-end convolutional U-Net on simulated room impulse responses generated via the image method to synthesize multi-channel inputs, leveraging delay (IPD/TDOA) information for separation. BASNet demonstrates on-device real-time performance and generalizes across devices with different microphone geometries, outperforming prior neural beamforming approaches and enabling steerable directivity. The method enables robust, low-latency enhancement suitable for telephony and video conferencing, with the ability to focus on targeted spatial regions through controllable input latency.
Abstract
We propose a neural network model that can separate target speech sources from interfering sources at different angular regions using two microphones. The model is trained with simulated room impulse responses (RIRs) using omni-directional microphones without needing to collect real RIRs. By relying on specific angular regions and multiple room simulations, the model utilizes consistent time difference of arrival (TDOA) cues, or what we call delay contrast, to separate target and interference sources while remaining robust in various reverberation environments. We demonstrate the model is not only generalizable to a commercially available device with a slightly different microphone geometry, but also outperforms our previous work which uses one additional microphone on the same device. The model runs in real-time on-device and is suitable for low-latency streaming applications such as telephony and video conferencing.
