All Neural Low-latency Directional Speech Extraction

Ashutosh Pandey; Sanha Lee; Juan Azcarreta; Daniel Wong; Buye Xu

All Neural Low-latency Directional Speech Extraction

Ashutosh Pandey, Sanha Lee, Juan Azcarreta, Daniel Wong, Buye Xu

TL;DR

This paper tackles low-latency directional speech extraction by introducing a fully neural framework that uses trainable DOA embeddings drawn from a predefined spatial grid to condition an all-neural, time-domain speech extractor called the Directional Recurrent Network (DRN). The method integrates two forms of DOA embeddings—channel-wise and frame-wise—into the spatial and temporal processing via a fusion scheme, achieving end-to-end optimization with a speech enhancement objective and millisecond-scale latency. Key findings show that DOA embeddings, especially azimuth-elevation, improve performance and robustness to DOA mismatch, with rapid adaptation to abrupt DOA switches in dynamic scenes; the approach also outperforms several strong baselines under low-latency constraints. The work advances practical directional speech extraction for multichannel setups, enabling fast, robust performance in scenarios like augmented reality and wearable devices, and points to future work on moving-source training and more dynamic motion handling.

Abstract

We introduce a novel all neural model for low-latency directional speech extraction. The model uses direction of arrival (DOA) embeddings from a predefined spatial grid, which are transformed and fused into a recurrent neural network based speech extraction model. This process enables the model to effectively extract speech from a specified DOA. Unlike previous methods that relied on hand-crafted directional features, the proposed model trains DOA embeddings from scratch using speech enhancement loss, making it suitable for low-latency scenarios. Additionally, it operates at a high frame rate, taking in DOA with each input frame, which brings in the capability of quickly adapting to changing scene in highly dynamic real-world scenarios. We provide extensive evaluation to demonstrate the model's efficacy in directional speech extraction, robustness to DOA mismatch, and its capability to quickly adapt to abrupt changes in DOA.

All Neural Low-latency Directional Speech Extraction

TL;DR

Abstract

Paper Structure (16 sections, 2 equations, 4 figures, 3 tables)

This paper contains 16 sections, 2 equations, 4 figures, 3 tables.

Introduction
Proposed method
Problem formulation
Model architecture
Channel-wise DOA embeddings
Frame-wise DOA embeddings
Fusing DOA embeddings
Dataset Generation
Results
Experiments design
Finding the optimal location embeddings
Higher latency and frequency domain processing
Switching the direction of the target speech
Baseline models
Robustness to DOA Mismatch
...and 1 more sections

Figures (4)

Figure 1: Schematic diagram of DRN.
Figure 2: Comparing channel-wise and frame-wise embeddings. In plots labels, A and AE respectively represents azimuth-only and azimuth-elevation embeddings, whereas the number represents hidden size $H$.
Figure 3: The spectrograms of a) the noisy audio, b) the DRN enhanced audio and c) the target audio. The input DOA of different talkers are switched to extract corresponding switched talkers at output.
Figure 4: Sensitivity of DRN to DOA mismatch.

All Neural Low-latency Directional Speech Extraction

TL;DR

Abstract

All Neural Low-latency Directional Speech Extraction

Authors

TL;DR

Abstract

Table of Contents

Figures (4)