Deep Learning Based Stage-wise Two-dimensional Speaker Localization with Large Ad-hoc Microphone Arrays

Shupei Liu; Linfeng Feng; Yijun Gong; Chengdong Liang; Chen Zhang; Xiao-Lei Zhang; Xuelong Li

Deep Learning Based Stage-wise Two-dimensional Speaker Localization with Large Ad-hoc Microphone Arrays

Shupei Liu, Linfeng Feng, Yijun Gong, Chengdong Liang, Chen Zhang, Xiao-Lei Zhang, Xuelong Li

TL;DR

This work addresses 2D speaker localization using large-scale ad-hoc microphone arrays by proposing a stage-wise framework that first estimates DOA at each node with CNN backbones, then selects reliable nodes, triangulates rough positions from bearing lines, and finalizes 2D locations via mean-shift clustering. It introduces two CNN backbones (CNN-MLC and CNN-Mask), employs unbiased label distribution to avoid quantization errors, and uses weighted adjacent decoding for continuous angle estimation. A novel Libri-adhoc-nodes10 real-world dataset is introduced, and extensive experiments on simulated and real data show that the proposed CNN-ULD-based approach outperforms conventional DOA and 2D localization methods, with robust performance in single-source and real-world settings, while ghost-speaker effects are analyzed and mitigated through clustering and node selection. The framework demonstrates strong generalization from simulation to real environments and offers a flexible, scalable solution for practical deployment of ad-hoc microphone networks in speaker localization tasks.

Abstract

While deep-learning-based speaker localization has shown advantages in challenging acoustic environments, it often yields only direction-of-arrival (DOA) cues rather than precise two-dimensional (2D) coordinates. To address this, we propose a novel deep-learning-based 2D speaker localization method leveraging ad-hoc microphone arrays, where an ad-hoc microphone array is composed of randomly distributed microphone nodes, each of which is equipped with a traditional array. Specifically, we first employ convolutional neural networks at each node to estimate speaker directions. Then, we integrate these DOA estimates using triangulation and clustering techniques to get 2D speaker locations. To further boost the estimation accuracy, we introduce a node selection algorithm that strategically filters the most reliable nodes. Extensive experiments on both simulated and real-world data demonstrate that our approach significantly outperforms conventional methods. The proposed node selection further refines performance. The real-world dataset in the experiment, named Libri-adhoc-node10 which is a newly recorded data described for the first time in this paper, is online available at https://github.com/Liu-sp/Libri-adhoc-nodes10.

Deep Learning Based Stage-wise Two-dimensional Speaker Localization with Large Ad-hoc Microphone Arrays

TL;DR

Abstract

Paper Structure (34 sections, 16 equations, 5 figures, 11 tables)

This paper contains 34 sections, 16 equations, 5 figures, 11 tables.

Introduction
Motivation and challenges
Framework of the proposed method
Goals and contributions
CNN-based DOA estimation at each single ad-hoc node
Backbone networks
CNN-MLC
CNN-Mask
Permutation ambiguity
Permutation invariant training
Location-based training
Label encoding
Weighted adjacent decoding
Node interaction
Node selection
...and 19 more sections

Figures (5)

Figure 1: Diagram of the proposed 2-dimensional speaker localization method based on deep learning.
Figure 2: The Eq. \ref{['eq:alpha_nonlinear']} and \ref{['eq:alpha_linear']} can be interpreted physically as follows: For nonlinear arrays, only one bearing line $\hat{\alpha}_{(n,b,1)}$ will be emitted. However, for linear arrays, two bearing lines $\hat{\alpha}_{(n,b,1)}$ and $\hat{\alpha}_{(n,b,2)}$ will be emitted.
Figure 3: The "ghost" speaker problem caused by linear arrays.
Figure 4: Recording environment and settings of the two rooms of Libri-adhoc-nodes10. The blue dots represent the positions of the ad-hoc nodes. The loudspeaker icons represent the positions and orientations of the speakers. Each configuration has two sub configurations. The difference between the sub configurations lies in the different self-rotation angles of the sub arrays.
Figure 5: Visualizations of the intersections produced by the pairs of ad-hoc nodes with respect to the MAE levels. The red and yellow points are the selected points by mean-shift clustering for estimating the final 2D speaker locations.

Deep Learning Based Stage-wise Two-dimensional Speaker Localization with Large Ad-hoc Microphone Arrays

TL;DR

Abstract

Deep Learning Based Stage-wise Two-dimensional Speaker Localization with Large Ad-hoc Microphone Arrays

Authors

TL;DR

Abstract

Table of Contents

Figures (5)