Table of Contents
Fetching ...

Array Geometry-Robust Attention-Based Neural Beamformer for Moving Speakers

Marvin Tammen, Tsubasa Ochiai, Marc Delcroix, Tomohiro Nakatani, Shoko Araki, Simon Doclo

TL;DR

This work tackles the challenge of robust mask-based beamforming for moving speakers across arbitrary microphone arrays. It extends an attention-based SCM aggregation framework with three robustness techniques: training with random channel configurations, a permutation-invariant TAC processing of multi-channel features, and robust mag-IPD input features. Experimental results on CHiME-3 and DEMAND demonstrate that combining these approaches yields consistent improvements in speech enhancement for moving speakers and unseen array geometries, outperforming baselines even under channel permutation, count, and geometry mismatches. The methods enhance deployment flexibility for practical multi-channel speech systems in dynamic environments.

Abstract

Although mask-based beamforming is a powerful speech enhancement approach, it often requires manual parameter tuning to handle moving speakers. Recently, this approach was augmented with an attention-based spatial covariance matrix aggregator (ASA) module, enabling accurate tracking of moving speakers without manual tuning. However, the deep neural network model used in this module is limited to specific microphone arrays, necessitating a different model for varying channel permutations, numbers, or geometries. To improve the robustness of the ASA module against such variations, in this paper we investigate three approaches: training with random channel configurations, employing the transform-average-concatenate method to process multi-channel input features, and utilizing robust input features. Our experiments on the CHiME-3 and DEMAND datasets show that these approaches enable the ASA-augmented beamformer to track moving speakers across different microphone arrays unseen in training.

Array Geometry-Robust Attention-Based Neural Beamformer for Moving Speakers

TL;DR

This work tackles the challenge of robust mask-based beamforming for moving speakers across arbitrary microphone arrays. It extends an attention-based SCM aggregation framework with three robustness techniques: training with random channel configurations, a permutation-invariant TAC processing of multi-channel features, and robust mag-IPD input features. Experimental results on CHiME-3 and DEMAND demonstrate that combining these approaches yields consistent improvements in speech enhancement for moving speakers and unseen array geometries, outperforming baselines even under channel permutation, count, and geometry mismatches. The methods enhance deployment flexibility for practical multi-channel speech systems in dynamic environments.

Abstract

Although mask-based beamforming is a powerful speech enhancement approach, it often requires manual parameter tuning to handle moving speakers. Recently, this approach was augmented with an attention-based spatial covariance matrix aggregator (ASA) module, enabling accurate tracking of moving speakers without manual tuning. However, the deep neural network model used in this module is limited to specific microphone arrays, necessitating a different model for varying channel permutations, numbers, or geometries. To improve the robustness of the ASA module against such variations, in this paper we investigate three approaches: training with random channel configurations, employing the transform-average-concatenate method to process multi-channel input features, and utilizing robust input features. Our experiments on the CHiME-3 and DEMAND datasets show that these approaches enable the ASA-augmented beamformer to track moving speakers across different microphone arrays unseen in training.
Paper Structure (14 sections, 6 equations, 3 figures, 1 table)

This paper contains 14 sections, 6 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: Overview of mask-based mvdr beamformer with . Grey vertically stacked boxes share weights.
  • Figure 2: Attention weight estimator employing different approaches to process multi-channel features. Grey vertically stacked boxes share weights.
  • Figure 3: Considered microphone array geometries. Grey circles denote the reference and white circles denote unused microphones.