Table of Contents
Fetching ...

Multi-channel Speech Separation Using Spatially Selective Deep Non-linear Filters

Kristina Tesch, Timo Gerkmann

TL;DR

This paper addresses multi-channel speech separation in reverberant environments by introducing a steerable deep non-linear spatial filter (SSF) that explicitly uses target DoA information to extract a chosen speaker. By comparing SSF with a strong direct-separation (DS) baseline using the same network architectures (JNF and McNet), the authors demonstrate that SSF yields substantial gains, especially as the number of concurrent speakers increases, due to better exploitation of spatial cues. The work also investigates robustness to DoA estimation errors and microphone-array perturbations, and generalization to near-field, similar-DoA scenarios, and unseen noise, showing that SSF generalizes better to unseen conditions than DS. Overall, the results support explicit spatial steering via SSF as a practical and effective approach for multi-channel speech separation with multiple active speakers. The methodology combines a non-linear joint spatial-temporal filtering framework with a DoA-conditioned steering mechanism, achieving improved separation quality (POLQA, SI-SDR, DNSMOS) and favorable perceptual outcomes in listening tests.

Abstract

In a multi-channel separation task with multiple speakers, we aim to recover all individual speech signals from the mixture. In contrast to single-channel approaches, which rely on the different spectro-temporal characteristics of the speech signals, multi-channel approaches should additionally utilize the different spatial locations of the sources for a more powerful separation especially when the number of sources increases. To enhance the spatial processing in a multi-channel source separation scenario, in this work, we propose a deep neural network (DNN) based spatially selective filter (SSF) that can be spatially steered to extract the speaker of interest by initializing a recurrent neural network layer with the target direction. We compare the proposed SSF with a common end-to-end direct separation (DS) approach trained using utterance-wise permutation invariant training (PIT), which only implicitly learns to perform spatial filtering. We show that the SSF has a clear advantage over a DS approach with the same underlying network architecture when there are more than two speakers in the mixture, which can be attributed to a better use of the spatial information. Furthermore, we find that the SSF generalizes much better to additional noise sources that were not seen during training and to scenarios with speakers positioned at a similar angle.

Multi-channel Speech Separation Using Spatially Selective Deep Non-linear Filters

TL;DR

This paper addresses multi-channel speech separation in reverberant environments by introducing a steerable deep non-linear spatial filter (SSF) that explicitly uses target DoA information to extract a chosen speaker. By comparing SSF with a strong direct-separation (DS) baseline using the same network architectures (JNF and McNet), the authors demonstrate that SSF yields substantial gains, especially as the number of concurrent speakers increases, due to better exploitation of spatial cues. The work also investigates robustness to DoA estimation errors and microphone-array perturbations, and generalization to near-field, similar-DoA scenarios, and unseen noise, showing that SSF generalizes better to unseen conditions than DS. Overall, the results support explicit spatial steering via SSF as a practical and effective approach for multi-channel speech separation with multiple active speakers. The methodology combines a non-linear joint spatial-temporal filtering framework with a DoA-conditioned steering mechanism, achieving improved separation quality (POLQA, SI-SDR, DNSMOS) and favorable perceptual outcomes in listening tests.

Abstract

In a multi-channel separation task with multiple speakers, we aim to recover all individual speech signals from the mixture. In contrast to single-channel approaches, which rely on the different spectro-temporal characteristics of the speech signals, multi-channel approaches should additionally utilize the different spatial locations of the sources for a more powerful separation especially when the number of sources increases. To enhance the spatial processing in a multi-channel source separation scenario, in this work, we propose a deep neural network (DNN) based spatially selective filter (SSF) that can be spatially steered to extract the speaker of interest by initializing a recurrent neural network layer with the target direction. We compare the proposed SSF with a common end-to-end direct separation (DS) approach trained using utterance-wise permutation invariant training (PIT), which only implicitly learns to perform spatial filtering. We show that the SSF has a clear advantage over a DS approach with the same underlying network architecture when there are more than two speakers in the mixture, which can be attributed to a better use of the spatial information. Furthermore, we find that the SSF generalizes much better to additional noise sources that were not seen during training and to scenarios with speakers positioned at a similar angle.
Paper Structure (18 sections, 4 equations, 8 figures, 5 tables)

This paper contains 18 sections, 4 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Schematic view of a spatially selective filter (SSF) based on the JNF (top) and McNet (bottom) network architecture. The proposed conditioning on the target DoA is depicted on the right side.
  • Figure 2: Illustration of the dataset generation. The target source is marked with a red cross and its DoA angle $\varphi_t$ is computed relative to the microphone orientation in the room given by $\varphi_m$. Interfering sources are placed in the gray area.
  • Figure 3: Results for a listening experiment assessing the participants' preference for separation results obtained with a spatial filter (SF) or a direct separation (DS) result. Speaker locations are assumed to be known for the spatial filter. The test is conducted blindly without test subjects knowing which example corresponds to which algorithm. The results have then been aggregated to match with the displayed statement.
  • Figure 4: Examples for blind speaker separation and localization by peak-searching for a mixture of two, three and five speakers using non-linear filters steered in all candidate directions. The vertical dashed gray lines indicate the true positions of the speakers and the green cross marks the speaker location estimated based on the energy peaks in the filter output.
  • Figure 5: Separation results for the McNet-SSF conditioned on a target angle that is subject to a localization error of varying magnitude. During evaluation, the respective error is added to all speakers' DoA angles. The results shown in the left plot are obtained with with a McNet-SSF that has been conditioned on the exact DoA location during training, while the results displayed on the left side are obtained with a McNet-SSF that has been trained with inaccurate DoAs that include an error of up to $4^\circ$.
  • ...and 3 more figures