Array Geometry-Robust Attention-Based Neural Beamformer for Moving Speakers
Marvin Tammen, Tsubasa Ochiai, Marc Delcroix, Tomohiro Nakatani, Shoko Araki, Simon Doclo
TL;DR
This work tackles the challenge of robust mask-based beamforming for moving speakers across arbitrary microphone arrays. It extends an attention-based SCM aggregation framework with three robustness techniques: training with random channel configurations, a permutation-invariant TAC processing of multi-channel features, and robust mag-IPD input features. Experimental results on CHiME-3 and DEMAND demonstrate that combining these approaches yields consistent improvements in speech enhancement for moving speakers and unseen array geometries, outperforming baselines even under channel permutation, count, and geometry mismatches. The methods enhance deployment flexibility for practical multi-channel speech systems in dynamic environments.
Abstract
Although mask-based beamforming is a powerful speech enhancement approach, it often requires manual parameter tuning to handle moving speakers. Recently, this approach was augmented with an attention-based spatial covariance matrix aggregator (ASA) module, enabling accurate tracking of moving speakers without manual tuning. However, the deep neural network model used in this module is limited to specific microphone arrays, necessitating a different model for varying channel permutations, numbers, or geometries. To improve the robustness of the ASA module against such variations, in this paper we investigate three approaches: training with random channel configurations, employing the transform-average-concatenate method to process multi-channel input features, and utilizing robust input features. Our experiments on the CHiME-3 and DEMAND datasets show that these approaches enable the ASA-augmented beamformer to track moving speakers across different microphone arrays unseen in training.
