All Neural Low-latency Directional Speech Extraction
Ashutosh Pandey, Sanha Lee, Juan Azcarreta, Daniel Wong, Buye Xu
TL;DR
This paper tackles low-latency directional speech extraction by introducing a fully neural framework that uses trainable DOA embeddings drawn from a predefined spatial grid to condition an all-neural, time-domain speech extractor called the Directional Recurrent Network (DRN). The method integrates two forms of DOA embeddings—channel-wise and frame-wise—into the spatial and temporal processing via a fusion scheme, achieving end-to-end optimization with a speech enhancement objective and millisecond-scale latency. Key findings show that DOA embeddings, especially azimuth-elevation, improve performance and robustness to DOA mismatch, with rapid adaptation to abrupt DOA switches in dynamic scenes; the approach also outperforms several strong baselines under low-latency constraints. The work advances practical directional speech extraction for multichannel setups, enabling fast, robust performance in scenarios like augmented reality and wearable devices, and points to future work on moving-source training and more dynamic motion handling.
Abstract
We introduce a novel all neural model for low-latency directional speech extraction. The model uses direction of arrival (DOA) embeddings from a predefined spatial grid, which are transformed and fused into a recurrent neural network based speech extraction model. This process enables the model to effectively extract speech from a specified DOA. Unlike previous methods that relied on hand-crafted directional features, the proposed model trains DOA embeddings from scratch using speech enhancement loss, making it suitable for low-latency scenarios. Additionally, it operates at a high frame rate, taking in DOA with each input frame, which brings in the capability of quickly adapting to changing scene in highly dynamic real-world scenarios. We provide extensive evaluation to demonstrate the model's efficacy in directional speech extraction, robustness to DOA mismatch, and its capability to quickly adapt to abrupt changes in DOA.
