Table of Contents
Fetching ...

Robust One-step Speech Enhancement via Consistency Distillation

Liang Xu, Longfei Felix Yan, W. Bastiaan Kleijn

TL;DR

This work tackles the latency barrier of diffusion-based speech enhancement by introducing ROSE-CD, Robust One-step Speech Enhancement via Consistency Distillation. By distilling a one-step consistency model from a 30-step teacher and incorporating randomized learning trajectories along with joint time-domain PESQ and SI-SDR losses, the method mitigates teacher bias and enhances robustness. The approach achieves a substantial $54\times$ inference speedup and reaches state-of-the-art perceptual quality on VoiceBank-DEMAND ($\text{PESQ}=3.99$), while showing strong generalization to out-of-domain and real-world noisy data. This combination of speed, robustness, and quality promises practical deployment of diffusion-based SE in real-time systems.

Abstract

Diffusion models have shown strong performance in speech enhancement, but their real-time applicability has been limited by multi-step iterative sampling. Consistency distillation has recently emerged as a promising alternative by distilling a one-step consistency model from a multi-step diffusion-based teacher model. However, distilled consistency models are inherently biased towards the sampling trajectory of the teacher model, making them less robust to noise and prone to inheriting inaccuracies from the teacher model. To address this limitation, we propose ROSE-CD: Robust One-step Speech Enhancement via Consistency Distillation, a novel approach for distilling a one-step consistency model. Specifically, we introduce a randomized learning trajectory to improve the model's robustness to noise. Furthermore, we jointly optimize the one-step model with two time-domain auxiliary losses, enabling it to recover from teacher-induced errors and surpass the teacher model in overall performance. This is the first pure one-step consistency distillation model for diffusion-based speech enhancement, achieving 54 times faster inference speed and superior performance compared to its 30-step teacher model. Experiments on the VoiceBank-DEMAND dataset demonstrate that the proposed model achieves state-of-the-art performance in terms of speech quality. Moreover, its generalization ability is validated on both an out-of-domain dataset and real-world noisy recordings.

Robust One-step Speech Enhancement via Consistency Distillation

TL;DR

This work tackles the latency barrier of diffusion-based speech enhancement by introducing ROSE-CD, Robust One-step Speech Enhancement via Consistency Distillation. By distilling a one-step consistency model from a 30-step teacher and incorporating randomized learning trajectories along with joint time-domain PESQ and SI-SDR losses, the method mitigates teacher bias and enhances robustness. The approach achieves a substantial inference speedup and reaches state-of-the-art perceptual quality on VoiceBank-DEMAND (), while showing strong generalization to out-of-domain and real-world noisy data. This combination of speed, robustness, and quality promises practical deployment of diffusion-based SE in real-time systems.

Abstract

Diffusion models have shown strong performance in speech enhancement, but their real-time applicability has been limited by multi-step iterative sampling. Consistency distillation has recently emerged as a promising alternative by distilling a one-step consistency model from a multi-step diffusion-based teacher model. However, distilled consistency models are inherently biased towards the sampling trajectory of the teacher model, making them less robust to noise and prone to inheriting inaccuracies from the teacher model. To address this limitation, we propose ROSE-CD: Robust One-step Speech Enhancement via Consistency Distillation, a novel approach for distilling a one-step consistency model. Specifically, we introduce a randomized learning trajectory to improve the model's robustness to noise. Furthermore, we jointly optimize the one-step model with two time-domain auxiliary losses, enabling it to recover from teacher-induced errors and surpass the teacher model in overall performance. This is the first pure one-step consistency distillation model for diffusion-based speech enhancement, achieving 54 times faster inference speed and superior performance compared to its 30-step teacher model. Experiments on the VoiceBank-DEMAND dataset demonstrate that the proposed model achieves state-of-the-art performance in terms of speech quality. Moreover, its generalization ability is validated on both an out-of-domain dataset and real-world noisy recordings.

Paper Structure

This paper contains 20 sections, 13 equations, 1 figure, 4 tables, 1 algorithm.

Figures (1)

  • Figure 1: Overview of the proposed robust consistency distillation (RCD). The thick green line illustrates the PF-ODE trajectory defined by a pre-trained diffusion teacher model. During distillation, given a sampled data point ${{x}}_{t_n}$ at time step $t_n$, we first estimate $\hat{{{x}}}_{t_{n-1}}^{{\phi}}$ using a one-step ODE solver. To improve robustness, a random noise perturbation is then applied to obtain a noised variant $\hat{{{x}}}_{r,t_{n-1}}^{{\phi}}$. Finally, the consistency model is trained within this robust consistency distillation range, which is highlighted in orange.