Table of Contents
Fetching ...

Reference Microphone Selection for Guided Source Separation based on the Normalized L-p Norm

Anselm Lohmann, Tomohiro Nakatani, Rintaro Ikeshita, Marc Delcroix, Shoko Araki, Simon Doclo

TL;DR

This work addresses how to select the reference microphone in guided source separation for distant ASR, highlighting that traditional SNR-based selection may miss early-to-late reverberation differences across microphones. It introduces two normalized $\ell_p$-norm–based strategies: one using only the normalized $\ell_p$-norm on the beamformer outputs and a second that blends this norm with the broadband SNR through a trade-off parameter $\alpha$. Evaluations on CHiME-8 distant ASR show that both methods outperform the SNR baseline, with the combined approach yielding the lowest macro-average tcpWER, demonstrating improved signal quality and ASR performance when reference mic selection accounts for ELR and SNR jointly. The results suggest that sparsity-aware selection of the beamformer outputs, optionally fused with SNR information, enhances robustness of GSS in spatially distributed microphone setups.

Abstract

Guided Source Separation (GSS) is a popular front-end for distant automatic speech recognition (ASR) systems using spatially distributed microphones. When considering spatially distributed microphones, the choice of reference microphone may have a large influence on the quality of the output signal and the downstream ASR performance. In GSS-based speech enhancement, reference microphone selection is typically performed using the signal-to-noise ratio (SNR), which is optimal for noise reduction but may neglect differences in early-to-late-reverberant ratio (ELR) across microphones. In this paper, we propose two reference microphone selection methods for GSS-based speech enhancement that are based on the normalized $\ell_p$-norm, either using only the normalized $\ell_p$-norm or combining the normalized $\ell_p$-norm and the SNR to account for both differences in SNR and ELR across microphones. Experimental evaluation using a CHiME-8 distant ASR system shows that the proposed $\ell_p$-norm-based methods outperform the baseline method, reducing the macro-average word error rate.

Reference Microphone Selection for Guided Source Separation based on the Normalized L-p Norm

TL;DR

This work addresses how to select the reference microphone in guided source separation for distant ASR, highlighting that traditional SNR-based selection may miss early-to-late reverberation differences across microphones. It introduces two normalized -norm–based strategies: one using only the normalized -norm on the beamformer outputs and a second that blends this norm with the broadband SNR through a trade-off parameter . Evaluations on CHiME-8 distant ASR show that both methods outperform the SNR baseline, with the combined approach yielding the lowest macro-average tcpWER, demonstrating improved signal quality and ASR performance when reference mic selection accounts for ELR and SNR jointly. The results suggest that sparsity-aware selection of the beamformer outputs, optionally fused with SNR information, enhances robustness of GSS in spatially distributed microphone setups.

Abstract

Guided Source Separation (GSS) is a popular front-end for distant automatic speech recognition (ASR) systems using spatially distributed microphones. When considering spatially distributed microphones, the choice of reference microphone may have a large influence on the quality of the output signal and the downstream ASR performance. In GSS-based speech enhancement, reference microphone selection is typically performed using the signal-to-noise ratio (SNR), which is optimal for noise reduction but may neglect differences in early-to-late-reverberant ratio (ELR) across microphones. In this paper, we propose two reference microphone selection methods for GSS-based speech enhancement that are based on the normalized -norm, either using only the normalized -norm or combining the normalized -norm and the SNR to account for both differences in SNR and ELR across microphones. Experimental evaluation using a CHiME-8 distant ASR system shows that the proposed -norm-based methods outperform the baseline method, reducing the macro-average word error rate.

Paper Structure

This paper contains 11 sections, 15 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: GSS-based speech enhancement
  • Figure 2: ASR performance of CHiME-8 system using both the normalized $\ell_p$-norm and SNR for reference microphone selection in terms of macro-average tcpWER (%) using oracle diarization labels and estimated diarization labels on CHiME-8 development data for different values of the trade-off parameter $\alpha$