Table of Contents
Fetching ...

UniArray: Unified Spectral-Spatial Modeling for Array-Geometry-Agnostic Speech Separation

Weiguang Chen, Junjie Zhang, Jielong Yang, Eng Siong Chng, Xionghu Zhong

TL;DR

UniArray tackles the challenge of array-geometry-agnostic speech separation by introducing a three-component pipeline: Virtual Microphone Estimation (VME) to normalize channel counts, a spectral-spatial feature extraction and fusion block that includes a Spatial Dictionary Learning (SDL) module operating at the frequency-bin level, and a hierarchical dual-path separator that efficiently captures dependencies along time and frequency using patch-based grouping and Conformer layers. The method achieves robust performance across seen and unseen microphone geometries, with significant improvements in SI-SDRi and perceptual metrics over state-of-the-art baselines, and demonstrates that VME and SDL components are key to generalization, while FSDL offers a useful, albeit slightly weaker, variant. Overall, UniArray offers a geometry-agnostic, computation-efficient approach to multi-channel speech separation that preserves spatial information more effectively than traditional interleaving strategies. The work highlights the practical impact of explicit spectral-spatial modeling and virtual-channel augmentation for real-world, ad-hoc microphone arrays.

Abstract

Array-geometry-agnostic speech separation (AGA-SS) aims to develop an effective separation method regardless of the microphone array geometry. Conventional methods rely on permutation-free operations, such as summation or attention mechanisms, to capture spatial information. However, these approaches often incur high computational costs or disrupt the effective use of spatial information during intra- and inter-channel interactions, leading to suboptimal performance. To address these issues, we propose UniArray, a novel approach that abandons the conventional interleaving manner. UniArray consists of three key components: a virtual microphone estimation (VME) module, a feature extraction and fusion module, and a hierarchical dual-path separator. The VME ensures robust performance across arrays with varying channel numbers. The feature extraction and fusion module leverages a spectral feature extraction module and a spatial dictionary learning (SDL) module to extract and fuse frequency-bin-level features, allowing the separator to focus on using the fused features. The hierarchical dual-path separator models feature dependencies along the time and frequency axes while maintaining computational efficiency. Experimental results show that UniArray outperforms state-of-the-art methods in SI-SDRi, WB-PESQ, NB-PESQ, and STOI across both seen and unseen array geometries.

UniArray: Unified Spectral-Spatial Modeling for Array-Geometry-Agnostic Speech Separation

TL;DR

UniArray tackles the challenge of array-geometry-agnostic speech separation by introducing a three-component pipeline: Virtual Microphone Estimation (VME) to normalize channel counts, a spectral-spatial feature extraction and fusion block that includes a Spatial Dictionary Learning (SDL) module operating at the frequency-bin level, and a hierarchical dual-path separator that efficiently captures dependencies along time and frequency using patch-based grouping and Conformer layers. The method achieves robust performance across seen and unseen microphone geometries, with significant improvements in SI-SDRi and perceptual metrics over state-of-the-art baselines, and demonstrates that VME and SDL components are key to generalization, while FSDL offers a useful, albeit slightly weaker, variant. Overall, UniArray offers a geometry-agnostic, computation-efficient approach to multi-channel speech separation that preserves spatial information more effectively than traditional interleaving strategies. The work highlights the practical impact of explicit spectral-spatial modeling and virtual-channel augmentation for real-world, ad-hoc microphone arrays.

Abstract

Array-geometry-agnostic speech separation (AGA-SS) aims to develop an effective separation method regardless of the microphone array geometry. Conventional methods rely on permutation-free operations, such as summation or attention mechanisms, to capture spatial information. However, these approaches often incur high computational costs or disrupt the effective use of spatial information during intra- and inter-channel interactions, leading to suboptimal performance. To address these issues, we propose UniArray, a novel approach that abandons the conventional interleaving manner. UniArray consists of three key components: a virtual microphone estimation (VME) module, a feature extraction and fusion module, and a hierarchical dual-path separator. The VME ensures robust performance across arrays with varying channel numbers. The feature extraction and fusion module leverages a spectral feature extraction module and a spatial dictionary learning (SDL) module to extract and fuse frequency-bin-level features, allowing the separator to focus on using the fused features. The hierarchical dual-path separator models feature dependencies along the time and frequency axes while maintaining computational efficiency. Experimental results show that UniArray outperforms state-of-the-art methods in SI-SDRi, WB-PESQ, NB-PESQ, and STOI across both seen and unseen array geometries.

Paper Structure

This paper contains 11 sections, 2 equations, 1 figure, 3 tables.

Figures (1)

  • Figure 1: The overall architecture of the proposed UniArray consists of three key modules: VME, feature extraction and fusion, and a hierarchical dual-path separator. The VME module generates virtual microphone signals to augment the number of channels up to the maximum $M$. The feature extraction and fusion module captures both spectral and spatial features at the frequency-bin level. Finally, the hierarchical dual-path separator estimates the clean spectrogram for each speaker.