Unsupervised speech enhancement with spectral kurtosis and double deep priors

Hien Ohnaka; Ryoichi Miyazaki

Unsupervised speech enhancement with spectral kurtosis and double deep priors

Hien Ohnaka, Ryoichi Miyazaki

TL;DR

The paper tackles unsupervised speech enhancement by addressing limitations of traditional deep-prior methods, notably early stopping and distortion trade-offs under environmental noise. It introduces a double deep-prior framework with two DNNs: one targets clean speech spectrograms and the other targets noise, such that their sum matches the observed noisy spectrogram; training employs a spectral-kurtosis loss to separate the components and a reconstruction term to align with the input. Experimental results show robust improvements over DP-based baselines in white Gaussian and environmental noise, and the approach effectively mitigates the early stopping problem, with ablations confirming the critical role of the kurtosis-based losses and the dual-DP design. These findings suggest a practical, unsupervised pathway for robust speech enhancement and point to future extensions in complex-domain processing and dereverberation tasks.

Abstract

This paper proposes an unsupervised DNN-based speech enhancement approach founded on deep priors (DPs). Here, DP signifies that DNNs are more inclined to produce clean speech signals than noises. Conventional methods based on DP typically involve training on a noisy speech signal using a random noise feature as input, stopping training only a clean speech signal is generated. However, such conventional approaches encounter challenges in determining the optimal stop timing, experience performance degradation due to environmental background noise, and suffer a trade-off between distortion of the clean speech signal and noise reduction performance. To address these challenges, we utilize two DNNs: one to generate a clean speech signal and the other to generate noise. The combined output of these networks closely approximates the noisy speech signal, with a loss term based on spectral kurtosis utilized to separate the noisy speech signal into a clean speech signal and noise. The key advantage of this method lies in its ability to circumvent trade-offs and early stopping problems, as the signal is decomposed by enough steps. Through evaluation experiments, we demonstrate that the proposed method outperforms conventional methods in the case of white Gaussian and environmental noise while effectively mitigating early stopping problems.

Unsupervised speech enhancement with spectral kurtosis and double deep priors

TL;DR

Abstract

Paper Structure (15 sections, 18 equations, 12 figures, 2 tables)

This paper contains 15 sections, 18 equations, 12 figures, 2 tables.

Introduction
Problem setting
Speech signal formulation
DP-based speech enhancement
Motivation
Proposed method
Overview
Design of deep priors
Loss term based on spectral kurtosis
Experimental evaluation
Experimental conditions
Evaluation results
Addressing the early stopping problem
Ablation study
Conclusion

Figures (12)

Figure 1: Conceptual diagram of image denoising via deep image prior DIP.
Figure 2: A graph of the PESQ PESQ score at each step.
Figure 3: Spectrogram of training results for noisy speech with environmental noise using DDU-net.
Figure 4: Concept of proposed method. (a) DNNs are trained so that the sum of $g_{\theta_1}(\bm{Z}_1), g_{\theta_2}(\bm{Z}_2)$ approaches $|\bm{X}|$ and the kurtosis of $|\hat{\bm{S}}|$ is high and $|\hat{\bm{N}}|$ is low. (b) After a sufficient number of step $t_o$ iterations, $M$ predicted clean speech signals are obtained. (c) The final result is a batch average of predicted clean speech signals.
Figure 5: Graph of softplus function.
...and 7 more figures

Unsupervised speech enhancement with spectral kurtosis and double deep priors

TL;DR

Abstract

Unsupervised speech enhancement with spectral kurtosis and double deep priors

Authors

TL;DR

Abstract

Table of Contents

Figures (12)