VM-UNSSOR: Unsupervised Neural Speech Separation Enhanced by Higher-SNR Virtual Microphone Arrays
Shulin He, Zhong-Qiu Wang
TL;DR
This work tackles unsupervised speech separation with unknown array geometry and limited physical microphones by augmenting the observed mix with higher-SNR virtual microphones formed via linear spatial demixers. The virtual microphones yield additional mixture-consistency constraints, encapsulated in a VM-specific loss: \\mathcal{L}_{VM} = \\alpha \\sum_{k\\in\\mathcal{R}} \\mathcal{L}_{MC,k} + \\beta \\sum_{k\\in\\mathcal{V}} \\mathcal{L}_{MC,k}, and use back-projected demixer outputs to stabilize training. The method introduces the augmented input set \\mathcal{U}=\\mathcal{R}\\cup\\mathcal{V} with Q = C \\cdot P_r virtual channels, enabling more robust separation without labeled sources. On SMS-WSJ, VM-UNSSOR achieves 17.1 dB SI-SDR and 18.0 dB SDR with six physical mics, and 10.7 dB SI-SDR with two physical mics where UNSSOR collapses, illustrating practical gains and applicability for rapid in-domain adaptation.
Abstract
Blind speech separation (BSS) aims to recover multiple speech sources from multi-channel, multi-speaker mixtures under unknown array geometry and room impulse responses. In unsupervised setup where clean target speech is not available for model training, UNSSOR proposes a mixture consistency (MC) loss for training deep neural networks (DNN) on over-determined training mixtures to realize unsupervised speech separation. However, when the number of microphones of the training mixtures decreases, the MC constraint weakens and the separation performance falls dramatically. To address this, we propose VM-UNSSOR, augmenting the observed training mixture signals recorded by a limited number of microphones with several higher-SNR virtual-microphone (VM) signals, which are obtained by applying linear spatial demixers (such as IVA and spatial clustering) to the observed training mixtures. As linear projections of the observed mixtures, the virtual-microphone signals can typically increase the SNR of each source and can be leveraged to compute extra MC losses to improve UNSSOR and address the frequency permutation problem in UNSSOR. On the SMS-WSJ dataset, in the over-determined six-microphone, two-speaker separation setup, VM-UNSSOR reaches 17.1 dB SI-SDR, while UNSSOR only obtains 14.7 dB; and in the determined two-microphone, two-speaker case, UNSSOR collapses to -2.7 dB SI-SDR, while VM-UNSSOR achieves 10.7 dB.
