Table of Contents
Fetching ...

Exploring the Potential of Data-Driven Spatial Audio Enhancement Using a Single-Channel Model

Arthur N. dos Santos, Bruno S. Masiero, Túlio C. L. Mateus

TL;DR

The work investigates whether a data-driven single-channel SE model can be effectively applied to multi-channel 3D audio by processing each channel independently. It benchmarks this SISO approach against two established multi-channel SE models (FaSNet and MMUB) using DOA estimation to assess spatial fidelity, revealing a trade-off: multi-channel methods yield higher intelligibility (STOI) but sacrifice spatial cues, while the single-channel approach preserves spatial information with comparatively lower intelligibility gains. The findings suggest that, for applications prioritizing spatial awareness (e.g., AR/VR), single-channel methods are viable, but for higher intelligibility, multi-channel processing remains advantageous; recent generative SISO models also show promise for improving perceptual quality. Overall, preserving inter-channel cues (ICLD/ICPD) is key when extending single-channel SE to multi-channel data, and future work could explore full-bandwidth enhancement and unsupervised learning to push the state of the art.

Abstract

One key aspect differentiating data-driven single- and multi-channel speech enhancement and dereverberation methods is that both the problem formulation and complexity of the solutions are considerably more challenging in the latter case. Additionally, with limited computational resources, it is cumbersome to train models that require the management of larger datasets or those with more complex designs. In this scenario, an unverified hypothesis that single-channel methods can be adapted to multi-channel scenarios simply by processing each channel independently holds significant implications, boosting compatibility between sound scene capture and system input-output formats, while also allowing modern research to focus on other challenging aspects, such as full-bandwidth audio enhancement, competitive noise suppression, and unsupervised learning. This study verifies this hypothesis by comparing the enhancement promoted by a basic single-channel speech enhancement and dereverberation model with two other multi-channel models tailored to separate clean speech from noisy 3D mixes. A direction of arrival estimation model was used to objectively evaluate its capacity to preserve spatial information by comparing the output signals with ground-truth coordinate values. Consequently, a trade-off arises between preserving spatial information with a more straightforward single-channel solution at the cost of obtaining lower gains in intelligibility scores.

Exploring the Potential of Data-Driven Spatial Audio Enhancement Using a Single-Channel Model

TL;DR

The work investigates whether a data-driven single-channel SE model can be effectively applied to multi-channel 3D audio by processing each channel independently. It benchmarks this SISO approach against two established multi-channel SE models (FaSNet and MMUB) using DOA estimation to assess spatial fidelity, revealing a trade-off: multi-channel methods yield higher intelligibility (STOI) but sacrifice spatial cues, while the single-channel approach preserves spatial information with comparatively lower intelligibility gains. The findings suggest that, for applications prioritizing spatial awareness (e.g., AR/VR), single-channel methods are viable, but for higher intelligibility, multi-channel processing remains advantageous; recent generative SISO models also show promise for improving perceptual quality. Overall, preserving inter-channel cues (ICLD/ICPD) is key when extending single-channel SE to multi-channel data, and future work could explore full-bandwidth enhancement and unsupervised learning to push the state of the art.

Abstract

One key aspect differentiating data-driven single- and multi-channel speech enhancement and dereverberation methods is that both the problem formulation and complexity of the solutions are considerably more challenging in the latter case. Additionally, with limited computational resources, it is cumbersome to train models that require the management of larger datasets or those with more complex designs. In this scenario, an unverified hypothesis that single-channel methods can be adapted to multi-channel scenarios simply by processing each channel independently holds significant implications, boosting compatibility between sound scene capture and system input-output formats, while also allowing modern research to focus on other challenging aspects, such as full-bandwidth audio enhancement, competitive noise suppression, and unsupervised learning. This study verifies this hypothesis by comparing the enhancement promoted by a basic single-channel speech enhancement and dereverberation model with two other multi-channel models tailored to separate clean speech from noisy 3D mixes. A direction of arrival estimation model was used to objectively evaluate its capacity to preserve spatial information by comparing the output signals with ground-truth coordinate values. Consequently, a trade-off arises between preserving spatial information with a more straightforward single-channel solution at the cost of obtaining lower gains in intelligibility scores.
Paper Structure (18 sections, 3 equations, 5 figures, 1 table)

This paper contains 18 sections, 3 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: The 4--channels of a noisy reverberant spatial sound scene (a--d), followed by the clean monophonic speech estimated by FaSNet (e) and MMUB (f), the enhancement promoted by the FC U-net for each channel independently (g--j), and a conversion to Mono format in the DOA detected by SELDnet. STOI scores are computed with reference to the clean monophonic speech signal $x(t)$.
  • Figure 2: Polar plots for the associated azimuth and elevation angles of the estimated desired sound source's DOA and ground-truth coordinate values.
  • Figure 3: STOI scores for the enhancement promoted by (a) FaSNet and (b) MMUB, using the L3DAS22 test set.
  • Figure 4: STOI scores for the enhancement promoted by the FC U-net, using the test set for Task 1 of L3DAS22.
  • Figure 5: Prediction error between SELDnet's DOA estimations using Mic A and Mic B recordings enhanced by the FC U-net as input and the test set ground-truth coordinate values in (a) Cartesian coordinates and the associated (b) distance (radius) and (c) polar angles.