Table of Contents
Fetching ...

Neural Directed Speech Enhancement with Dual Microphone Array in High Noise Scenario

Wen Wen, Qiang Zhou, Yu Xi, Haoyu Li, Ziqi Gong, Kai Yu

TL;DR

A causal-directed U-Net (CDUNet) model is introduced, which takes raw multi-channel speech and the desired enhancement width as inputs and enables dynamic adjustment of steering vectors based on the target direction and fine-tuning of the enhancement region according to the angular separation between the target and interference signals.

Abstract

In multi-speaker scenarios, leveraging spatial features is essential for enhancing target speech. While with limited microphone arrays, developing a compact multi-channel speech enhancement system remains challenging, especially in extremely low signal-to-noise ratio (SNR) conditions. To tackle this issue, we propose a triple-steering spatial selection method, a flexible framework that uses three steering vectors to guide enhancement and determine the enhancement range. Specifically, we introduce a causal-directed U-Net (CDUNet) model, which takes raw multi-channel speech and the desired enhancement width as inputs. This enables dynamic adjustment of steering vectors based on the target direction and fine-tuning of the enhancement region according to the angular separation between the target and interference signals. Our model with only a dual microphone array, excels in both speech quality and downstream task performance. It operates in real-time with minimal parameters, making it ideal for low-latency, on-device streaming applications.

Neural Directed Speech Enhancement with Dual Microphone Array in High Noise Scenario

TL;DR

A causal-directed U-Net (CDUNet) model is introduced, which takes raw multi-channel speech and the desired enhancement width as inputs and enables dynamic adjustment of steering vectors based on the target direction and fine-tuning of the enhancement region according to the angular separation between the target and interference signals.

Abstract

In multi-speaker scenarios, leveraging spatial features is essential for enhancing target speech. While with limited microphone arrays, developing a compact multi-channel speech enhancement system remains challenging, especially in extremely low signal-to-noise ratio (SNR) conditions. To tackle this issue, we propose a triple-steering spatial selection method, a flexible framework that uses three steering vectors to guide enhancement and determine the enhancement range. Specifically, we introduce a causal-directed U-Net (CDUNet) model, which takes raw multi-channel speech and the desired enhancement width as inputs. This enables dynamic adjustment of steering vectors based on the target direction and fine-tuning of the enhancement region according to the angular separation between the target and interference signals. Our model with only a dual microphone array, excels in both speech quality and downstream task performance. It operates in real-time with minimal parameters, making it ideal for low-latency, on-device streaming applications.

Paper Structure

This paper contains 16 sections, 5 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Illustration of the CDUNet architecture. The beamformer output incorporates both the target direction and the width input, which captures the spatial area information crucial for enhancement. $\varphi_{width}$ denotes the extent of the target region to be enhanced, and $\varphi_{target}$ specifies the orientation of the target speaker. The "Near Mic. Selection" operation selects the speech signal from the microphone that is positioned closer to the target speaker.
  • Figure 2: Illustration of the simulation setup of the first fixed-target dataset. The target direction ranges from 85° to 95°, represented by red stars in the figure, while the interference direction is located 15° away from the target direction, indicated by green stars. Room information is uniformly sampled from the provided ranges in the table.