Exploring the Potential of Data-Driven Spatial Audio Enhancement Using a Single-Channel Model
Arthur N. dos Santos, Bruno S. Masiero, Túlio C. L. Mateus
TL;DR
The work investigates whether a data-driven single-channel SE model can be effectively applied to multi-channel 3D audio by processing each channel independently. It benchmarks this SISO approach against two established multi-channel SE models (FaSNet and MMUB) using DOA estimation to assess spatial fidelity, revealing a trade-off: multi-channel methods yield higher intelligibility (STOI) but sacrifice spatial cues, while the single-channel approach preserves spatial information with comparatively lower intelligibility gains. The findings suggest that, for applications prioritizing spatial awareness (e.g., AR/VR), single-channel methods are viable, but for higher intelligibility, multi-channel processing remains advantageous; recent generative SISO models also show promise for improving perceptual quality. Overall, preserving inter-channel cues (ICLD/ICPD) is key when extending single-channel SE to multi-channel data, and future work could explore full-bandwidth enhancement and unsupervised learning to push the state of the art.
Abstract
One key aspect differentiating data-driven single- and multi-channel speech enhancement and dereverberation methods is that both the problem formulation and complexity of the solutions are considerably more challenging in the latter case. Additionally, with limited computational resources, it is cumbersome to train models that require the management of larger datasets or those with more complex designs. In this scenario, an unverified hypothesis that single-channel methods can be adapted to multi-channel scenarios simply by processing each channel independently holds significant implications, boosting compatibility between sound scene capture and system input-output formats, while also allowing modern research to focus on other challenging aspects, such as full-bandwidth audio enhancement, competitive noise suppression, and unsupervised learning. This study verifies this hypothesis by comparing the enhancement promoted by a basic single-channel speech enhancement and dereverberation model with two other multi-channel models tailored to separate clean speech from noisy 3D mixes. A direction of arrival estimation model was used to objectively evaluate its capacity to preserve spatial information by comparing the output signals with ground-truth coordinate values. Consequently, a trade-off arises between preserving spatial information with a more straightforward single-channel solution at the cost of obtaining lower gains in intelligibility scores.
