Table of Contents
Fetching ...

DeepFilterGAN: A Full-band Real-time Speech Enhancement System with GAN-based Stochastic Regeneration

Sanberk Serbest, Tijana Stojkovic, Milos Cernak, Andrew Harper

TL;DR

This work addresses real-time, full-band single-channel speech enhancement by blending predictive and generative approaches through stochastic regeneration. It introduces DeepFilterGAN, a two-stage system where a predictive DeepFilterNet2 first enhances the signal, and a GAN-based second stage (generator inspired by Online SpatialNet, discriminator from MelGAN) refines the output using both the noisy input and the first-stage result as conditioning. The approach achieves low latency (~40 ms) with a compact model (~3.58 million parameters) and shows improved non-intrusive speech quality (NISQA-MOS) over the first stage, with ablations confirming the value of noisy conditioning. Compared to heavier baselines like UNIVERSE++ (107.5M parameters), DeepFilterGAN offers a favorable trade-off between quality and latency, making it suitable for streaming deployment while maintaining strong intelligibility and ASR-related metrics. Future directions include joint end-to-end training of both stages and exploring different second-stage architectures such as Mamba-based components.

Abstract

In this work, we propose a full-band real-time speech enhancement system with GAN-based stochastic regeneration. Predictive models focus on estimating the mean of the target distribution, whereas generative models aim to learn the full distribution. This behavior of predictive models may lead to over-suppression, i.e. the removal of speech content. In the literature, it was shown that combining a predictive model with a generative one within the stochastic regeneration framework can reduce the distortion in the output. We use this framework to obtain a real-time speech enhancement system. With 3.58M parameters and a low latency, our system is designed for real-time streaming with a lightweight architecture. Experiments show that our system improves over the first stage in terms of NISQA-MOS metric. Finally, through an ablation study, we show the importance of noisy conditioning in our system. We participated in 2025 Urgent Challenge with our model and later made further improvements.

DeepFilterGAN: A Full-band Real-time Speech Enhancement System with GAN-based Stochastic Regeneration

TL;DR

This work addresses real-time, full-band single-channel speech enhancement by blending predictive and generative approaches through stochastic regeneration. It introduces DeepFilterGAN, a two-stage system where a predictive DeepFilterNet2 first enhances the signal, and a GAN-based second stage (generator inspired by Online SpatialNet, discriminator from MelGAN) refines the output using both the noisy input and the first-stage result as conditioning. The approach achieves low latency (~40 ms) with a compact model (~3.58 million parameters) and shows improved non-intrusive speech quality (NISQA-MOS) over the first stage, with ablations confirming the value of noisy conditioning. Compared to heavier baselines like UNIVERSE++ (107.5M parameters), DeepFilterGAN offers a favorable trade-off between quality and latency, making it suitable for streaming deployment while maintaining strong intelligibility and ASR-related metrics. Future directions include joint end-to-end training of both stages and exploring different second-stage architectures such as Mamba-based components.

Abstract

In this work, we propose a full-band real-time speech enhancement system with GAN-based stochastic regeneration. Predictive models focus on estimating the mean of the target distribution, whereas generative models aim to learn the full distribution. This behavior of predictive models may lead to over-suppression, i.e. the removal of speech content. In the literature, it was shown that combining a predictive model with a generative one within the stochastic regeneration framework can reduce the distortion in the output. We use this framework to obtain a real-time speech enhancement system. With 3.58M parameters and a low latency, our system is designed for real-time streaming with a lightweight architecture. Experiments show that our system improves over the first stage in terms of NISQA-MOS metric. Finally, through an ablation study, we show the importance of noisy conditioning in our system. We participated in 2025 Urgent Challenge with our model and later made further improvements.

Paper Structure

This paper contains 11 sections, 4 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: Stochastic regeneration. $y$ is the noisy speech, $z$ is the intermediate enhanced speech and $\hat{x}$ is the final enhanced speech, i.e. the clean speech estimate of the overall system.
  • Figure 2: Our proposed model. $Y(k,f)$ is the STFT of noisy speech, $Z(k,f)$ is the STFT of the intermediate enhanced speech and $\hat{X}(k,f)$ is the STFT of the final enhanced speech, i.e. the clean speech estimate of the overall system.
  • Figure 3: The recovery performance of our proposed system. The area in the green box is removed in the first stage output. Our system with noisy concatenation recovers some portion of this segment while the model without noisy concatenation can't recover it.