MeanFlowSE: One-Step Generative Speech Enhancement via MeanFlow

Yike Zhu; Boyi Kang; Ziqian Wang; Xingchen Li; Zihan Zhang; Wenjie Li; Longshuai Xiao; Wei Xue; Lei Xie

MeanFlowSE: One-Step Generative Speech Enhancement via MeanFlow

Yike Zhu, Boyi Kang, Ziqian Wang, Xingchen Li, Zihan Zhang, Wenjie Li, Longshuai Xiao, Wei Xue, Lei Xie

TL;DR

MeanFlowSE addresses the latency and efficiency challenges of generative speech enhancement by adopting a one-step refinement driven by an average-velocity field and conditioning on SSL representations. The framework combines an SSL encoder, a VAE encoder–decoder, and a DiT-based MeanFlow backbone to directly transform noisy latents into clean latents in a single step, with guidance from high-quality SSL features and the explicit one-step relation $z_0 = z_1 - u(z_1,0,1)$. Experimental results on the Interspeech 2020 DNS Challenge show state-of-the-art perceptual quality and competitive intelligibility, while achieving a low real-time factor and a compact 40.7M parameter footprint. These findings demonstrate MeanFlowSE's practicality for real-time deployment and its potential for edge devices.

Abstract

Speech enhancement (SE) recovers clean speech from noisy signals and is vital for applications such as telecommunications and automatic speech recognition (ASR). While generative approaches achieve strong perceptual quality, they often rely on multi-step sampling (diffusion/flow-matching) or large language models, limiting real-time deployment. To mitigate these constraints, we present MeanFlowSE, a one-step generative SE framework. It adopts MeanFlow to predict an average-velocity field for one-step latent refinement and conditions the model on self-supervised learning (SSL) representations rather than VAE latents. This design accelerates inference and provides robust acoustic-semantic guidance during training. In the Interspeech 2020 DNS Challenge blind test set and simulated test set, MeanFlowSE attains state-of-the-art (SOTA) level perceptual quality and competitive intelligibility while significantly lowering both real-time factor (RTF) and model size compared with recent generative competitors, making it suitable for practical use. The code will be released upon publication at https://github.com/Hello3orld/MeanFlowSE.

MeanFlowSE: One-Step Generative Speech Enhancement via MeanFlow

TL;DR

Abstract

MeanFlowSE: One-Step Generative Speech Enhancement via MeanFlow

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (1)