Table of Contents
Fetching ...

Efficient Area-based and Speaker-Agnostic Source Separation

Martin Strauss, Okan Köpüklü

TL;DR

This work tackles ROI-based speech separation in virtual meetings using a two-microphone array. It adopts a lightweight CRUSE-based network that processes multi-channel STFT inputs to produce a complex mask $Q$, enabling the recovery of ROI speech via $\hat{\mathbf{t}} = \text{iSTFT}\{ Q \odot \mathbf{Y}_{\phi} \}$. The key contributions are (i) adapting CRUSE for multi-channel ROI preservation with high efficiency, (ii) comprehensive evaluation against a Conv-TasNet baseline using DNSMOS and SI-SDR, and (iii) a PR heatmap to visualize ROI coverage and suppression, demonstrating robust ROI retention and interference suppression in real-time scenarios. Results indicate that CRUSE$_{s,h}$ offers the best balance between perceptual quality and computational efficiency, with Conv-TasNet delivering higher SI-SDR at the cost of greater complexity. The findings support practical, real-time ROI-focused speech separation for meeting scenarios and privacy-aware spatial audio applications.

Abstract

This paper introduces an area-based source separation method designed for virtual meeting scenarios. The aim is to preserve speech signals from an unspecified number of sources within a defined spatial area in front of a linear microphone array, while suppressing all other sounds. Therefore, we employ an efficient neural network architecture adapted for multi-channel input to encompass the predefined target area. To evaluate the approach, training data and specific test scenarios including multiple target and interfering speakers, as well as background noise are simulated. All models are rated according to DNSMOS and scale-invariant signal-to-distortion ratio. Our experiments show that the proposed method separates speech from multiple speakers within the target area well, besides being of very low complexity, intended for real-time processing. In addition, a power reduction heatmap is used to demonstrate the networks' ability to identify sources located within the target area. We put our approach in context with a well-established baseline for speaker-speaker separation and discuss its strengths and challenges.

Efficient Area-based and Speaker-Agnostic Source Separation

TL;DR

This work tackles ROI-based speech separation in virtual meetings using a two-microphone array. It adopts a lightweight CRUSE-based network that processes multi-channel STFT inputs to produce a complex mask , enabling the recovery of ROI speech via . The key contributions are (i) adapting CRUSE for multi-channel ROI preservation with high efficiency, (ii) comprehensive evaluation against a Conv-TasNet baseline using DNSMOS and SI-SDR, and (iii) a PR heatmap to visualize ROI coverage and suppression, demonstrating robust ROI retention and interference suppression in real-time scenarios. Results indicate that CRUSE offers the best balance between perceptual quality and computational efficiency, with Conv-TasNet delivering higher SI-SDR at the cost of greater complexity. The findings support practical, real-time ROI-focused speech separation for meeting scenarios and privacy-aware spatial audio applications.

Abstract

This paper introduces an area-based source separation method designed for virtual meeting scenarios. The aim is to preserve speech signals from an unspecified number of sources within a defined spatial area in front of a linear microphone array, while suppressing all other sounds. Therefore, we employ an efficient neural network architecture adapted for multi-channel input to encompass the predefined target area. To evaluate the approach, training data and specific test scenarios including multiple target and interfering speakers, as well as background noise are simulated. All models are rated according to DNSMOS and scale-invariant signal-to-distortion ratio. Our experiments show that the proposed method separates speech from multiple speakers within the target area well, besides being of very low complexity, intended for real-time processing. In addition, a power reduction heatmap is used to demonstrate the networks' ability to identify sources located within the target area. We put our approach in context with a well-established baseline for speaker-speaker separation and discuss its strengths and challenges.
Paper Structure (13 sections, 5 equations, 3 figures, 5 tables)

This paper contains 13 sections, 5 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Illustration of the investigated scenario. The speech sources inside the ROI with an angle of $\alpha=60^{\circ}$ are kept, while suppressing interfering speakers and noise. Speech sources and noise are denoted by X and $\bigstar$, respectively. The area and sources of interest are colorized in green. The dashed lines bound the mirrored area due to front-back ambiguity.
  • Figure 2: The network architecture. The input signal in STFT domain is concatenated along the microphone channel dimension and the output $Q$ is a single-channel complex-valued separation mask used to extract the target components.
  • Figure 3: PR heatmap of a ROI with $\alpha=60^{\circ}$ using the $\text{CRUSE}_{c,l}$ model. Due to front-back ambiguity for ULAs, only half the room is shown.