Diffusion Gaussian Mixture Audio Denoise

Pu Wang; Junhui Li; Jialu Li; Liangdong Guo; Youshan Zhang

Diffusion Gaussian Mixture Audio Denoise

Pu Wang, Junhui Li, Jialu Li, Liangdong Guo, Youshan Zhang

TL;DR

This work tackles audio denoising under non-Gaussian real-world noise by introducing DiffGMM, a diffusion-based framework that replaces the isotropic Gaussian assumption with a learnable Gaussian Mixture Model (GMM) for noise. The reverse diffusion process is used to estimate GMM parameters $\pi_k$, $\mu_k$, and $\Sigma_k$, with a 1D-U-Net extracting features to predict these parameters, and a loss combining an ELBO term with a noise-prediction component. The method achieves state-of-the-art results on VoiceBank-DEMAND and BirdSoundsDenoising, demonstrating strong perceptual and SDR improvements and validating the effectiveness of modeling non-Gaussian noise within diffusion denoising. This approach broadens the applicability of diffusion-based denoising to real-world, non-Gaussian noise scenarios and offers a practical pathway for robust audio enhancement in diverse environments.

Abstract

Recent diffusion models have achieved promising performances in audio-denoising tasks. The unique property of the reverse process could recover clean signals. However, the distribution of real-world noises does not comply with a single Gaussian distribution and is even unknown. The sampling of Gaussian noise conditions limits its application scenarios. To overcome these challenges, we propose a DiffGMM model, a denoising model based on the diffusion and Gaussian mixture models. We employ the reverse process to estimate parameters for the Gaussian mixture model. Given a noisy audio signal, we first apply a 1D-U-Net to extract features and train linear layers to estimate parameters for the Gaussian mixture model, and we approximate the real noise distributions. The noisy signal is continuously subtracted from the estimated noise to output clean audio signals. Extensive experimental results demonstrate that the proposed DiffGMM model achieves state-of-the-art performance.

Diffusion Gaussian Mixture Audio Denoise

TL;DR

, and

, with a 1D-U-Net extracting features to predict these parameters, and a loss combining an ELBO term with a noise-prediction component. The method achieves state-of-the-art results on VoiceBank-DEMAND and BirdSoundsDenoising, demonstrating strong perceptual and SDR improvements and validating the effectiveness of modeling non-Gaussian noise within diffusion denoising. This approach broadens the applicability of diffusion-based denoising to real-world, non-Gaussian noise scenarios and offers a practical pathway for robust audio enhancement in diverse environments.

Abstract

Paper Structure (14 sections, 17 equations, 3 figures, 3 tables, 2 algorithms)

This paper contains 14 sections, 17 equations, 3 figures, 3 tables, 2 algorithms.

Introduction
Methods
Problem
Motivation
Preliminary
Reverse Process.
Gaussian mixture model
Methodology
Loss Function
Experiments
Datasets
Implementation details
Performance comparisons
Conclusion

Figures (3)

Figure 1: Flowchart of our diffusion Gaussian mixture (DiffGMM) model. We first utilize a 1D-U-Net to estimate the parameters $\pi_k, \mu_k$ and $\Sigma_k$ of GMM. We then approximate the additive noise distribution ($x_{app\_noisy}$) using GMM. The real noise is one representation to ease understanding of the GMM approximation. Finally, we continuously utilize the noisy audio signal to subtract the estimated additive noisy signal to distill a clean audio signal.
Figure 2: PESQ of different K
Figure 3: Five different Gaussian distributions are obtained through DiffGMM in the original audio. The figure shows the Gaussian distributions corresponding to classes 1-5, and their parameters $\pi_{k}, \mu_{k}, \Sigma_{k}$ are shown below. The sixth figure is the original noisy signal and the estimated noisy signal. The X-axis is the audio length, and the Y-axis is the audio range.

Diffusion Gaussian Mixture Audio Denoise

TL;DR

Abstract

Diffusion Gaussian Mixture Audio Denoise

Authors

TL;DR

Abstract

Table of Contents

Figures (3)