A Hybrid Model for Weakly-Supervised Speech Dereverberation
Louis Bahrman, Mathieu Fontaine, Gael Richard
TL;DR
This work tackles data-scarce single-channel speech dereverberation by introducing a reverberation-based weak supervision framework that trains a dereverberation model to output a dry estimate whose post-processing with a synthesized RIR matches the observed reverberant signal. The method uses a cross-band convolutive model and a reverberation-matching loss, replacing traditional paired dry/wet targets and mitigating metric-only optimization. Empirical results show superior cross-metric performance and robustness compared to metrics-based weak supervision, with strong supervision offering gains for certain architectures. The approach enables effective dereverberation under limited acoustic information and holds promise for extending to other domains like music and more realistic RIR models.
Abstract
This paper introduces a new training strategy to improve speech dereverberation systems using minimal acoustic information and reverberant (wet) speech. Most existing algorithms rely on paired dry/wet data, which is difficult to obtain, or on target metrics that may not adequately capture reverberation characteristics and can lead to poor results on non-target metrics. Our approach uses limited acoustic information, like the reverberation time (RT60), to train a dereverberation system. The system's output is resynthesized using a generated room impulse response and compared with the original reverberant speech, providing a novel reverberation matching loss replacing the standard target metrics. During inference, only the trained dereverberation model is used. Experimental results demonstrate that our method achieves more consistent performance across various objective metrics used in speech dereverberation than the state-of-the-art.
