Table of Contents
Fetching ...

Statistical Beamformer Exploiting Non-stationarity and Sparsity with Spatially Constrained ICA for Robust Speech Recognition

Ui-Hyeop Shin, Hyung-Min Park

TL;DR

The paper tackles robust automatic speech recognition by integrating a generalized statistical beamforming framework with a sparsity-aware complex Laplacian target model and time-varying variances. It introduces Mask-S-MLDR, which combines target outputs and input masks to form weighted SCMs for robust beamforming, and develops an ICA-based steering-vector estimation with distortionless and null constraints that are enhanced by a hybrid ICA-HC penalty approach. An online RLS-based algorithm enables frame-by-frame joint updates of beamforming and SVE, supporting practical real-time ASR in nonstationary environments. Experimental results on CHiME-4 and LibriCSS demonstrate improved WER over conventional methods across batch and online processing, including dynamic target positions, highlighting the method’s robustness and generalizability.

Abstract

In this paper, we present a statistical beamforming algorithm as a pre-processing step for robust automatic speech recognition (ASR). By modeling the target speech as a non-stationary Laplacian distribution, a mask-based statistical beamforming algorithm is proposed to exploit both its output and masked input variance for robust estimation of the beamformer. In addition, we also present a method for steering vector estimation (SVE) based on a noise power ratio obtained from the target and noise outputs in independent component analysis (ICA). To update the beamformer in the same ICA framework, we derive ICA with distortionless and null constraints on target speech, which yields beamformed speech at the target output and noises at the other outputs, respectively. The demixing weights for the target output result in a statistical beamformer with the weighted spatial covariance matrix (wSCM) using a weighting function characterized by a source model. To enhance the SVE, the strict null constraints imposed by the Lagrange multiplier methods are relaxed by generalized penalties with weight parameters, while the strict distortionless constraints are maintained. Furthermore, we derive an online algorithm based on an optimization technique of recursive least squares (RLS) for practical applications. Experimental results on various environments using CHiME-4 and LibriCSS datasets demonstrate the effectiveness of the presented algorithm compared to conventional beamforming and blind source extraction (BSE) based on ICA on both batch and online processing.

Statistical Beamformer Exploiting Non-stationarity and Sparsity with Spatially Constrained ICA for Robust Speech Recognition

TL;DR

The paper tackles robust automatic speech recognition by integrating a generalized statistical beamforming framework with a sparsity-aware complex Laplacian target model and time-varying variances. It introduces Mask-S-MLDR, which combines target outputs and input masks to form weighted SCMs for robust beamforming, and develops an ICA-based steering-vector estimation with distortionless and null constraints that are enhanced by a hybrid ICA-HC penalty approach. An online RLS-based algorithm enables frame-by-frame joint updates of beamforming and SVE, supporting practical real-time ASR in nonstationary environments. Experimental results on CHiME-4 and LibriCSS demonstrate improved WER over conventional methods across batch and online processing, including dynamic target positions, highlighting the method’s robustness and generalizability.

Abstract

In this paper, we present a statistical beamforming algorithm as a pre-processing step for robust automatic speech recognition (ASR). By modeling the target speech as a non-stationary Laplacian distribution, a mask-based statistical beamforming algorithm is proposed to exploit both its output and masked input variance for robust estimation of the beamformer. In addition, we also present a method for steering vector estimation (SVE) based on a noise power ratio obtained from the target and noise outputs in independent component analysis (ICA). To update the beamformer in the same ICA framework, we derive ICA with distortionless and null constraints on target speech, which yields beamformed speech at the target output and noises at the other outputs, respectively. The demixing weights for the target output result in a statistical beamformer with the weighted spatial covariance matrix (wSCM) using a weighting function characterized by a source model. To enhance the SVE, the strict null constraints imposed by the Lagrange multiplier methods are relaxed by generalized penalties with weight parameters, while the strict distortionless constraints are maintained. Furthermore, we derive an online algorithm based on an optimization technique of recursive least squares (RLS) for practical applications. Experimental results on various environments using CHiME-4 and LibriCSS datasets demonstrate the effectiveness of the presented algorithm compared to conventional beamforming and blind source extraction (BSE) based on ICA on both batch and online processing.
Paper Structure (24 sections, 64 equations, 1 figure, 8 tables, 1 algorithm)

This paper contains 24 sections, 64 equations, 1 figure, 8 tables, 1 algorithm.

Figures (1)

  • Figure 1: Simulated room and microphone configuration: a target speech source moved instantaneously within an utterance between two random blue points (A, B, C, and D). A linear microphone array was simulated, with an inter-distance of 5 cm between adjacent microphones. The center microphone was fixed at (3.5 m, 2.5 m). There were 28 noise sources located at intervals of 1 m along the sides of the room. The heights of all the sources were 2 m.