Conditional Latent Diffusion-Based Speech Enhancement Via Dual Context Learning
Shengkui Zhao, Zexu Pan, Kun Zhou, Yukun Ma, Chong Zhang, Bin Ma
TL;DR
This work tackles speech enhancement by combining a conditional latent diffusion model (cLDM) with dual-context learning (DCL) to model both speech and noise latent distributions. By compressing mel-spectrograms with a variational autoencoder into a low-dimensional latent space, diffusion is performed efficiently, and conditioning on both noisy latents and text prompts via cross-attention guides the restoration. DCL trains the model to handle both clean speech and background noise distributions, improving generalization to unseen noises and reducing the required number of diffusion steps. Empirical results on LibriSpeech, AudioSet, VoiceBank+DEMAND, and DNS demonstrate superior robustness and perceptual quality, with competitive or better metrics and real-time performance compared to existing diffusion baselines. The approach offers practical implications for real-time, in-the-wild speech enhancement with strong out-of-domain generalization.
Abstract
Recently, the application of diffusion probabilistic models has advanced speech enhancement through generative approaches. However, existing diffusion-based methods have focused on the generation process in high-dimensional waveform or spectral domains, leading to increased generation complexity and slower inference speeds. Additionally, these methods have primarily modelled clean speech distributions, with limited exploration of noise distributions, thereby constraining the discriminative capability of diffusion models for speech enhancement. To address these issues, we propose a novel approach that integrates a conditional latent diffusion model (cLDM) with dual-context learning (DCL). Our method utilizes a variational autoencoder (VAE) to compress mel-spectrograms into a low-dimensional latent space. We then apply cLDM to transform the latent representations of both clean speech and background noise into Gaussian noise by the DCL process, and a parameterized model is trained to reverse this process, conditioned on noisy latent representations and text embeddings. By operating in a lower-dimensional space, the latent representations reduce the complexity of the generation process, while the DCL process enhances the model's ability to handle diverse and unseen noise environments. Our experiments demonstrate the strong performance of the proposed approach compared to existing diffusion-based methods, even with fewer iterative steps, and highlight the superior generalization capability of our models to out-of-domain noise datasets (https://github.com/modelscope/ClearerVoice-Studio).
