Table of Contents
Fetching ...

PlumberNet: Fixing interference leakage after GEV beamforming

François Grondin, Caleb Rascón

TL;DR

Generalized Eigenvalue (GEV) beamforming is employed to provide the leakage estimation, along with the estimation of the target speech, to be later used for postfiltering, which improves the enhancement performance over a postfilter that uses the target speech and a reference microphone signal.

Abstract

Spatial filters can exploit deep-learning-based speech enhancement models to increase their reliability in scenarios with multiple speech sources scenarios. To further improve speech quality, it is common to perform postfiltering on the estimated target speech obtained with spatial filtering. In this work, Generalized Eigenvalue (GEV) beamforming is employed to provide the leakage estimation, along with the estimation of the target speech, to be later used for postfiltering. This improves the enhancement performance over a postfilter that uses the target speech and a reference microphone signal. This work also demonstrates that the spatial covariance matrices (SCMs) can be accurately estimated from the direction of arrival (DoA) of the target and a discriminative selection amongst the pairwise estimated time-frequency masks.

PlumberNet: Fixing interference leakage after GEV beamforming

TL;DR

Generalized Eigenvalue (GEV) beamforming is employed to provide the leakage estimation, along with the estimation of the target speech, to be later used for postfiltering, which improves the enhancement performance over a postfilter that uses the target speech and a reference microphone signal.

Abstract

Spatial filters can exploit deep-learning-based speech enhancement models to increase their reliability in scenarios with multiple speech sources scenarios. To further improve speech quality, it is common to perform postfiltering on the estimated target speech obtained with spatial filtering. In this work, Generalized Eigenvalue (GEV) beamforming is employed to provide the leakage estimation, along with the estimation of the target speech, to be later used for postfiltering. This improves the enhancement performance over a postfilter that uses the target speech and a reference microphone signal. This work also demonstrates that the spatial covariance matrices (SCMs) can be accurately estimated from the direction of arrival (DoA) of the target and a discriminative selection amongst the pairwise estimated time-frequency masks.
Paper Structure (4 sections, 14 equations, 3 figures, 2 tables)

This paper contains 4 sections, 14 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Proposed method. SteerNet with discriminative selection generates a mask to isolate the target based on its DoA. The generated mask and its complement are used to estimate the target and covariance matrices. A GEV beamformer generates the estimated target signal, and a second one produces the leakage signal. These two signals are fed to PlumberNet, which estimates a fine-grain mask to further improve the target speech quality.
  • Figure 2: Mask estimation using the most discriminative pair of microphones. SteerNet estimates a mask for each pair, and then the instance with the least amount of time-frequency bin close to a value of $1$ is chosen. In this example, the pair of microphones $2$ and $3$ show an ambiguity as the target and interference have the same time difference of arrival, and the resulting mask shows more time-frequency bins with a value close to $1$ compared to pairs $(1,2)$, and $(1,3)$.
  • Figure 3: Microphone array geometries used for testing grondin2020gev.