Diffusion-based Signal Refiner for Speech Enhancement and Separation

Masato Hirano; Ryosuke Sawata; Naoki Murata; Shusuke Takahashi; Yuki Mitsufuji

Diffusion-based Signal Refiner for Speech Enhancement and Separation

Masato Hirano, Ryosuke Sawata, Naoki Murata, Shusuke Takahashi, Yuki Mitsufuji

TL;DR

This work addresses the mismatch between objective metrics and human perceptual quality in speech processing by introducing Diffiner, a DDRM-based diffusion post-refiner trained solely on clean speech. Diffiner can refine outputs from any preceding SE or SS model without specialized retraining, improving perceptual quality (NISQA, DNSMOS) and delivering better human-listening results through diffusion-based generation that fills in missing or artifact-laden regions. The authors extend prior Diffiner work to cover both enhancement and separation, proposing SE and SS inference rules, a sigmoid-based noise design for SS, and a BASIS-inspired shared observation approach; they validate with large-scale experiments and a MUSHRA test, showing meaningful perceptual gains albeit with some trade-offs in reference-based metrics. The findings indicate Diffiner’s practical potential as a universal, modular post-processor to raise the perceptual quality of existing speech pipelines, with blending strategies offering flexible control over downstream objectives such as ASR or MOS.

Abstract

Although recent speech processing technologies have achieved significant improvements in objective metrics, there still remains a gap in human perceptual quality. This paper proposes Diffiner, a novel solution that utilizes the powerful generative capability of diffusion models' prior distributions to address this fundamental issue. Diffiner leverages the probabilistic generative framework of diffusion models and learns natural prior distributions of clean speech to convert outputs from existing speech processing systems into perceptually natural high-quality audio. In contrast to conventional deterministic approaches, our method simultaneously analyzes both the original degraded speech and the pre-processed speech to accurately identify unnatural artifacts introduced during processing. Then, through the iterative sampling process of the diffusion model, these degraded portions are replaced with perceptually natural and high-quality speech segments. Experimental results indicate that Diffiner can recover a clearer harmonic structure of speech, which is shown to result in improved perceptual quality w.r.t. several metrics as well as in a human listening test. This highlights Diffiner's efficacy as a versatile post-processor for enhancing existing speech processing pipelines.

Diffusion-based Signal Refiner for Speech Enhancement and Separation

TL;DR

Abstract

Paper Structure (26 sections, 22 equations, 11 figures, 3 tables, 1 algorithm)

This paper contains 26 sections, 22 equations, 11 figures, 3 tables, 1 algorithm.

Introduction
Related work
Non-DNN-based methods
DNN-based methods
Deep generative model-based methods
Background
Linear inverse problem
Learning data distribution by using DDPM
Conditional generation with DDRM
Diffiner: DDRM-based speech refiner
Preliminaries
Refiner for speech enhancement
Refiner for speech separation
Experiments
Speech enhancement
...and 11 more sections

Figures (11)

Figure 1: Outline of Diffiner for speech separation. $M$ tracks are perturbed in parallel with Gaussian noise in the forward process (left). The learned denoiser then iteratively generates the sample in the reverse process (right). The outputs of the preceding separation are used for conditioning (upper). In this way, the generation of each source is integrally guided over the shared DDRM update.
Figure 2: Boxplots comparing the results of MUSHRA subjective listening test for speech enhancement (left) and speech separation (right). Twelve participants took part in the experiment.
Figure 3: Comparison of spectrograms.
Figure 4: Change in reference-based and reference-free metrics, i.e., SI-SDR and NISQA, by changing the blending weight $\xi$. The blended signal $\tilde{\bm{x}}$ was calculated by simply adding the weighted preceding method's output $\bm{x}_0$ and Diffiner's output $\hat{\bm{x}}$, i.e., $\tilde{\bm{x}} = \xi \bm{x}_0 + (1-\xi) \hat{\bm{x}}$.
Figure 5: Word Error Rate evaluation using different ASR models for blended outputs. The horizontal axis corresponds to the blending coefficient $\xi$, ranging from 0.0 to 1.0. The blended signal $\tilde{\bm{x}}$ was obtained as $\tilde{\bm{x}} = \xi \bm{x}_0 + (1-\xi) \hat{\bm{x}}$, where $\bm{x}_0$ and $\hat{\bm{x}}$ are respectively the outputs of Dual-path RNN (DPRNN) and Diffiner.
...and 6 more figures

Diffusion-based Signal Refiner for Speech Enhancement and Separation

TL;DR

Abstract

Diffusion-based Signal Refiner for Speech Enhancement and Separation

Authors

TL;DR

Abstract

Table of Contents

Figures (11)