Diffusion-based Unsupervised Audio-visual Speech Enhancement

Jean-Eudes Ayilo; Mostafa Sadeghi; Romain Serizel; Xavier Alameda-Pineda

Diffusion-based Unsupervised Audio-visual Speech Enhancement

Jean-Eudes Ayilo, Mostafa Sadeghi, Romain Serizel, Xavier Alameda-Pineda

TL;DR

This paper tackles robust speech enhancement under unseen noise by introducing a diffusion-based unsupervised audio-visual speech enhancement (AVSE) framework. It learns a diffusion prior for clean speech conditioned on lip-video features and couples it with an NMF noise model via an EM-like loop, using a score network to approximate the diffusion score $\nabla_{\mathbf{s}_t} \log p_t(\mathbf{s}_t)$. The AV extension incorporates cross-attentive fusion with visual features from AV-HuBERT, forming a conditional prior $p(\mathbf{s}|\mathbf{v})$, and introduces AV-UDiffSE+, a fast inference variant that applies Tweedie-based estimation of $\mathbf{s}_0$ and a single reverse step per iteration to accelerate processing. Experiments show AVSE improvements over audio-only baselines and better generalization than a recent supervised AVSE, with AV-UDiffSE+ offering a favorable speed-accuracy trade-off suitable for practical deployment. The work advances unsupervised AVSE by leveraging multimodal priors and efficient inference for real-time capable speech enhancement.

Abstract

This paper proposes a new unsupervised audio-visual speech enhancement (AVSE) approach that combines a diffusion-based audio-visual speech generative model with a non-negative matrix factorization (NMF) noise model. First, the diffusion model is pre-trained on clean speech conditioned on corresponding video data to simulate the speech generative distribution. This pre-trained model is then paired with the NMF-based noise model to estimate clean speech iteratively. Specifically, a diffusion-based posterior sampling approach is implemented within the reverse diffusion process, where after each iteration, a speech estimate is obtained and used to update the noise parameters. Experimental results confirm that the proposed AVSE approach not only outperforms its audio-only counterpart but also generalizes better than a recent supervised-generative AVSE method. Additionally, the new inference algorithm offers a better balance between inference speed and performance compared to the previous diffusion-based method. Code and demo available at: https://jeaneudesayilo.github.io/fast_UdiffSE

Diffusion-based Unsupervised Audio-visual Speech Enhancement

TL;DR

. The AV extension incorporates cross-attentive fusion with visual features from AV-HuBERT, forming a conditional prior

, and introduces AV-UDiffSE+, a fast inference variant that applies Tweedie-based estimation of

and a single reverse step per iteration to accelerate processing. Experiments show AVSE improvements over audio-only baselines and better generalization than a recent supervised AVSE, with AV-UDiffSE+ offering a favorable speed-accuracy trade-off suitable for practical deployment. The work advances unsupervised AVSE by leveraging multimodal priors and efficient inference for real-time capable speech enhancement.

Abstract

Paper Structure (9 sections, 8 equations, 1 figure, 1 table, 1 algorithm)

This paper contains 9 sections, 8 equations, 1 figure, 1 table, 1 algorithm.

Introduction
Diffusion-based unsupervised SE
Speech generative modeling
Unsupervised speech enhancement
Diffusion-based unsupervised AVSE
Audio-visual speech generative model
Fast inference algorithm
Experiments
Conclusion

Figures (1)

Figure 1: Schematic diagram of the proposed AV-U-Net (score model) architecture.

Diffusion-based Unsupervised Audio-visual Speech Enhancement

TL;DR

Abstract

Diffusion-based Unsupervised Audio-visual Speech Enhancement

Authors

TL;DR

Abstract

Table of Contents

Figures (1)