Table of Contents
Fetching ...

Robust Target Speaker Direction of Arrival Estimation

Zixuan Li, Shulin He, Xueliang Zhang

TL;DR

The paper tackles robust target speaker DOA estimation in multi‑speaker, noisy, and reverberant environments. It introduces RTS‑DOA, a deep neural network that integrates a speech enhancement module, a spatial module with full‑band and sub‑band processing, and a speaker feature module that leverages a target voiceprint. Through joint optimization, RTS‑DOA achieves substantial improvements in DOA accuracy (AR) while maintaining a compact model (0.12M parameters in the standard version) and demonstrates strong performance on LibriSpeech with realistic room and noise conditions. The approach highlights the practical value of combining speech quality improvement, advanced spatial representations, and anchor speaker information for reliable target DOA in multi‑speaker scenarios.

Abstract

In multi-speaker environments the direction of arrival (DOA) of a target speaker is key for improving speech clarity and extracting target speaker's voice. However, traditional DOA estimation methods often struggle in the presence of noise, reverberation, and particularly when competing speakers are present. To address these challenges, we propose RTS-DOA, a robust real-time DOA estimation system. This system innovatively uses the registered speech of the target speaker as a reference and leverages full-band and sub-band spectral information from a microphone array to estimate the DOA of the target speaker's voice. Specifically, the system comprises a speech enhancement module for initially improving speech quality, a spatial module for learning spatial information, and a speaker module for extracting voiceprint features. Experimental results on the LibriSpeech dataset demonstrate that our RTS-DOA system effectively tackles multi-speaker scenarios and established new optimal benchmarks.

Robust Target Speaker Direction of Arrival Estimation

TL;DR

The paper tackles robust target speaker DOA estimation in multi‑speaker, noisy, and reverberant environments. It introduces RTS‑DOA, a deep neural network that integrates a speech enhancement module, a spatial module with full‑band and sub‑band processing, and a speaker feature module that leverages a target voiceprint. Through joint optimization, RTS‑DOA achieves substantial improvements in DOA accuracy (AR) while maintaining a compact model (0.12M parameters in the standard version) and demonstrates strong performance on LibriSpeech with realistic room and noise conditions. The approach highlights the practical value of combining speech quality improvement, advanced spatial representations, and anchor speaker information for reliable target DOA in multi‑speaker scenarios.

Abstract

In multi-speaker environments the direction of arrival (DOA) of a target speaker is key for improving speech clarity and extracting target speaker's voice. However, traditional DOA estimation methods often struggle in the presence of noise, reverberation, and particularly when competing speakers are present. To address these challenges, we propose RTS-DOA, a robust real-time DOA estimation system. This system innovatively uses the registered speech of the target speaker as a reference and leverages full-band and sub-band spectral information from a microphone array to estimate the DOA of the target speaker's voice. Specifically, the system comprises a speech enhancement module for initially improving speech quality, a spatial module for learning spatial information, and a speaker module for extracting voiceprint features. Experimental results on the LibriSpeech dataset demonstrate that our RTS-DOA system effectively tackles multi-speaker scenarios and established new optimal benchmarks.

Paper Structure

This paper contains 13 sections, 3 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Structure of the six-channel microphone array. For brevity, the figure presents the source signal distribution from 0° to 180°. The unseen distribution from 180° to 360° is a mirror reflection of the shown pattern, ensuring symmetry.
  • Figure 2: (a) The overall structure of the RTS-DOA system. (b) The structure of ConvGLU, where $\sigma$ represents the Sigmoid activation function and $\odot$ denotes element-wise multiplication. (c) The detailed structure of the Spatial Layer.