Déréverbération non-supervisée de la parole par modèle hybride
Louis Bahrman, Mathieu Fontaine, Gaël Richard
TL;DR
This work tackles unsupervised speech dereverberation by leveraging only reverberant speech and limited acoustic cues such as RT60. It proposes a hybrid framework where a dereverberation neural network is trained to produce a dry speech estimate that, when passed through a differentiable RIS-based reverberation model and a time-frequency convolution, reproduces the observed reverberant signal, using a reverberation-matching loss rather than paired dry signals. A non-parametric RIS synthesizer and an inter-band TFCT convolution enable backpropagation without needing clean targets, with RIS parameters re-sampled during training. Experimental results on WSJ1-derived data show that this auto-supervised, reverberation-guided approach yields more consistent improvements across SISDR, ESTOI, and WB-PESQ than MetricGAN-U, and the method demonstrates robustness to errors in RT60 estimation; the authors also release code and pretrained models for reproducibility.
Abstract
This paper introduces a new training strategy to improve speech dereverberation systems in an unsupervised manner using only reverberant speech. Most existing algorithms rely on paired dry/reverberant data, which is difficult to obtain. Our approach uses limited acoustic information, like the reverberation time (RT60), to train a dereverberation system. Experimental results demonstrate that our method achieves more consistent performance across various objective metrics than the state-of-the-art.
