Unsupervised Multi-channel Speech Dereverberation via Diffusion

Yulun Wu; Zhongweiyang Xu; Jianchong Chen; Zhong-Qiu Wang; Romit Roy Choudhury

Unsupervised Multi-channel Speech Dereverberation via Diffusion

Yulun Wu, Zhongweiyang Xu, Jianchong Chen, Zhong-Qiu Wang, Romit Roy Choudhury

TL;DR

The paper tackles multi-channel blind speech dereverberation by introducing USD-DPS, an unsupervised framework that leverages a strong unconditional diffusion prior for clean speech and posterior sampling to recover the reference-channel signal from reverberant multi-channel mixtures. It combines a parameterized sub-band RIR model for the reference channel with an analytical, FCP-based estimation of non-reference RIRs to form a tractable likelihood guidance within the diffusion sampling process, enforcing multi-channel consistency. Empirical results on WSJ0CAM-DEREVERB show USD-DPS achieving state-of-the-art performance among unsupervised methods and favorable efficiency compared to MC-BUDDy, with competitive results against supervised baselines in several metrics. The work highlights the benefit of jointly leveraging diffusion priors and principled RIR estimation to enable effective unsupervised dereverberation in multi-channel array setups, with potential extensions to broader array inverse problems.

Abstract

We consider the problem of multi-channel single-speaker blind dereverberation, where multi-channel mixtures are used to recover the clean anechoic speech. To solve this problem, we propose USD-DPS, {U}nsupervised {S}peech {D}ereverberation via {D}iffusion {P}osterior {S}ampling. USD-DPS uses an unconditional clean speech diffusion model as a strong prior to solve the problem by posterior sampling. At each diffusion sampling step, we estimate all microphone channels' room impulse responses (RIRs), which are further used to enforce a multi-channel mixture consistency constraint for diffusion guidance. For multi-channel RIR estimation, we estimate reference-channel RIR by optimizing RIR parameters of a sub-band RIR signal model, with the Adam optimizer. We estimate non-reference channels' RIRs analytically using forward convolutive prediction (FCP). We found that this combination provides a good balance between sampling efficiency and RIR prior modeling, which shows superior performance among unsupervised dereverberation approaches. An audio demo page is provided in https://usddps.github.io/USDDPS_demo/.

Unsupervised Multi-channel Speech Dereverberation via Diffusion

TL;DR

Abstract

Unsupervised Multi-channel Speech Dereverberation via Diffusion

TL;DR

Abstract

Paper Structure

Table of Contents