Diffusion-based Signal Refiner for Speech Enhancement and Separation
Masato Hirano, Ryosuke Sawata, Naoki Murata, Shusuke Takahashi, Yuki Mitsufuji
TL;DR
This work addresses the mismatch between objective metrics and human perceptual quality in speech processing by introducing Diffiner, a DDRM-based diffusion post-refiner trained solely on clean speech. Diffiner can refine outputs from any preceding SE or SS model without specialized retraining, improving perceptual quality (NISQA, DNSMOS) and delivering better human-listening results through diffusion-based generation that fills in missing or artifact-laden regions. The authors extend prior Diffiner work to cover both enhancement and separation, proposing SE and SS inference rules, a sigmoid-based noise design for SS, and a BASIS-inspired shared observation approach; they validate with large-scale experiments and a MUSHRA test, showing meaningful perceptual gains albeit with some trade-offs in reference-based metrics. The findings indicate Diffiner’s practical potential as a universal, modular post-processor to raise the perceptual quality of existing speech pipelines, with blending strategies offering flexible control over downstream objectives such as ASR or MOS.
Abstract
Although recent speech processing technologies have achieved significant improvements in objective metrics, there still remains a gap in human perceptual quality. This paper proposes Diffiner, a novel solution that utilizes the powerful generative capability of diffusion models' prior distributions to address this fundamental issue. Diffiner leverages the probabilistic generative framework of diffusion models and learns natural prior distributions of clean speech to convert outputs from existing speech processing systems into perceptually natural high-quality audio. In contrast to conventional deterministic approaches, our method simultaneously analyzes both the original degraded speech and the pre-processed speech to accurately identify unnatural artifacts introduced during processing. Then, through the iterative sampling process of the diffusion model, these degraded portions are replaced with perceptually natural and high-quality speech segments. Experimental results indicate that Diffiner can recover a clearer harmonic structure of speech, which is shown to result in improved perceptual quality w.r.t. several metrics as well as in a human listening test. This highlights Diffiner's efficacy as a versatile post-processor for enhancing existing speech processing pipelines.
