Table of Contents
Fetching ...

Enhanced Reverberation as Supervision for Unsupervised Speech Separation

Kohei Saijo, Gordon Wichern, François G. Germain, Zexu Pan, Jonathan Le Roux

TL;DR

This work tackles unsupervised monaural speech separation in the determined two-channel setting by extending Reverberation as Supervision (RAS) to Enhanced Reverberation as Supervision (ERAS). ERAS stabilizes training with a high-weight ISMS loss and improves separation via an inter-channel consistency (ICC) loss, formalized as ${\mathcal{L}}_{ERAS}^{(m_r\xrightarrow{}m)} = {\mathcal{L}}_{RAS+ISMS}^{(m_r\xrightarrow{}m)} + \gamma {\mathcal{L}}_{ICC}^{(m_r\xrightarrow{}m)}$, and employs a two-stage training schedule to balance stability and performance. A two-stage training strategy first optimizes with ${\beta}>0$ (and optionally ${\gamma}>0$) and then fine-tunes with ${\beta}=0$ while retaining ICC guidance, yielding robust convergence and strong separation quality. Experiments on WHAMR! and SMS-WSJ show ERAS achieving higher SI-SNR, SDR, and PESQ than RAS/UNSSOR baselines, though still below fully supervised performance, indicating practical gains for real-world unsupervised separation under reverberant multi-channel conditions.

Abstract

Reverberation as supervision (RAS) is a framework that allows for training monaural speech separation models from multi-channel mixtures in an unsupervised manner. In RAS, models are trained so that sources predicted from a mixture at an input channel can be mapped to reconstruct a mixture at a target channel. However, stable unsupervised training has so far only been achieved in over-determined source-channel conditions, leaving the key determined case unsolved. This work proposes enhanced RAS (ERAS) for solving this problem. Through qualitative analysis, we found that stable training can be achieved by leveraging the loss term to alleviate the frequency-permutation problem. Separation performance is also boosted by adding a novel loss term where separated signals mapped back to their own input mixture are used as pseudo-targets for the signals separated from other channels and mapped to the same channel. Experimental results demonstrate high stability and performance of ERAS.

Enhanced Reverberation as Supervision for Unsupervised Speech Separation

TL;DR

This work tackles unsupervised monaural speech separation in the determined two-channel setting by extending Reverberation as Supervision (RAS) to Enhanced Reverberation as Supervision (ERAS). ERAS stabilizes training with a high-weight ISMS loss and improves separation via an inter-channel consistency (ICC) loss, formalized as , and employs a two-stage training schedule to balance stability and performance. A two-stage training strategy first optimizes with (and optionally ) and then fine-tunes with while retaining ICC guidance, yielding robust convergence and strong separation quality. Experiments on WHAMR! and SMS-WSJ show ERAS achieving higher SI-SNR, SDR, and PESQ than RAS/UNSSOR baselines, though still below fully supervised performance, indicating practical gains for real-world unsupervised separation under reverberant multi-channel conditions.

Abstract

Reverberation as supervision (RAS) is a framework that allows for training monaural speech separation models from multi-channel mixtures in an unsupervised manner. In RAS, models are trained so that sources predicted from a mixture at an input channel can be mapped to reconstruct a mixture at a target channel. However, stable unsupervised training has so far only been achieved in over-determined source-channel conditions, leaving the key determined case unsolved. This work proposes enhanced RAS (ERAS) for solving this problem. Through qualitative analysis, we found that stable training can be achieved by leveraging the loss term to alleviate the frequency-permutation problem. Separation performance is also boosted by adding a novel loss term where separated signals mapped back to their own input mixture are used as pseudo-targets for the signals separated from other channels and mapped to the same channel. Experimental results demonstrate high stability and performance of ERAS.
Paper Structure (13 sections, 11 equations, 1 figure, 4 tables)

This paper contains 13 sections, 11 equations, 1 figure, 4 tables.

Figures (1)

  • Figure 1: Overview of ERAS training. Separated signals at the left (L) or right (R) channel are mapped to the opposite channel by relative RIR estimation, and the model is trained to reconstruct mixtures as the sum of the mapped sources (RAS loss). ERAS improves training stability by strongly penalize undesirable solution by ISMS loss and boosts performance by introducing an inter-channel consistency (ICC) loss aiming to make sources mapped to the same channel closer.