Reference Microphone Selection for Guided Source Separation based on the Normalized L-p Norm
Anselm Lohmann, Tomohiro Nakatani, Rintaro Ikeshita, Marc Delcroix, Shoko Araki, Simon Doclo
TL;DR
This work addresses how to select the reference microphone in guided source separation for distant ASR, highlighting that traditional SNR-based selection may miss early-to-late reverberation differences across microphones. It introduces two normalized $\ell_p$-norm–based strategies: one using only the normalized $\ell_p$-norm on the beamformer outputs and a second that blends this norm with the broadband SNR through a trade-off parameter $\alpha$. Evaluations on CHiME-8 distant ASR show that both methods outperform the SNR baseline, with the combined approach yielding the lowest macro-average tcpWER, demonstrating improved signal quality and ASR performance when reference mic selection accounts for ELR and SNR jointly. The results suggest that sparsity-aware selection of the beamformer outputs, optionally fused with SNR information, enhances robustness of GSS in spatially distributed microphone setups.
Abstract
Guided Source Separation (GSS) is a popular front-end for distant automatic speech recognition (ASR) systems using spatially distributed microphones. When considering spatially distributed microphones, the choice of reference microphone may have a large influence on the quality of the output signal and the downstream ASR performance. In GSS-based speech enhancement, reference microphone selection is typically performed using the signal-to-noise ratio (SNR), which is optimal for noise reduction but may neglect differences in early-to-late-reverberant ratio (ELR) across microphones. In this paper, we propose two reference microphone selection methods for GSS-based speech enhancement that are based on the normalized $\ell_p$-norm, either using only the normalized $\ell_p$-norm or combining the normalized $\ell_p$-norm and the SNR to account for both differences in SNR and ELR across microphones. Experimental evaluation using a CHiME-8 distant ASR system shows that the proposed $\ell_p$-norm-based methods outperform the baseline method, reducing the macro-average word error rate.
