Self-Supervised Speech Quality Estimation and Enhancement Using Only Clean Speech

Szu-Wei Fu; Kuo-Hsuan Hung; Yu Tsao; Yu-Chiang Frank Wang

Self-Supervised Speech Quality Estimation and Enhancement Using Only Clean Speech

Szu-Wei Fu, Kuo-Hsuan Hung, Yu Tsao, Yu-Chiang Frank Wang

TL;DR

The paper tackles label-efficient speech quality assessment and enhancement by introducing VQScore, a self-supervised metric derived from the quantization error of a VQ-VAE trained on clean speech, with the code-space cosine similarity given by $VQScore_{(cos,z)} = \frac{1}{T} \sum_{t=1}^{T} \cos(Z_t, Z_{q_t})$. It then extends the framework to self-supervised speech enhancement via a self-distillation approach with adversarial training, guided by the three-term loss $L = dist(X, \hat{X}) + ||sg(Z_t) - Z_{q_t}||_{2} + \beta ||Z_t - sg(Z_{q_t})||_{2}$. The method uses an Encoder E, a Vector Quantizer Q with codebook C, and a Decoder D, with Transformer blocks around the VQ and EMA-updated codebooks to enable robust performance. Empirical results show that VQScore is competitive with supervised baselines for quality estimation and that the self-supervised SE with Adversarial Training achieves strong results, particularly under domain mismatch, highlighting the practical potential of label-free speech quality assessment and enhancement.

Abstract

Speech quality estimation has recently undergone a paradigm shift from human-hearing expert designs to machine-learning models. However, current models rely mainly on supervised learning, which is time-consuming and expensive for label collection. To solve this problem, we propose VQScore, a self-supervised metric for evaluating speech based on the quantization error of a vector-quantized-variational autoencoder (VQ-VAE). The training of VQ-VAE relies on clean speech; hence, large quantization errors can be expected when the speech is distorted. To further improve correlation with real quality scores, domain knowledge of speech processing is incorporated into the model design. We found that the vector quantization mechanism could also be used for self-supervised speech enhancement (SE) model training. To improve the robustness of the encoder for SE, a novel self-distillation mechanism combined with adversarial training is introduced. In summary, the proposed speech quality estimation method and enhancement models require only clean speech for training without any label requirements. Experimental results show that the proposed VQScore and enhancement model are competitive with supervised baselines. The code will be released after publication.

Self-Supervised Speech Quality Estimation and Enhancement Using Only Clean Speech

TL;DR

. It then extends the framework to self-supervised speech enhancement via a self-distillation approach with adversarial training, guided by the three-term loss

. The method uses an Encoder E, a Vector Quantizer Q with codebook C, and a Decoder D, with Transformer blocks around the VQ and EMA-updated codebooks to enable robust performance. Empirical results show that VQScore is competitive with supervised baselines for quality estimation and that the self-supervised SE with Adversarial Training achieves strong results, particularly under domain mismatch, highlighting the practical potential of label-free speech quality assessment and enhancement.

Abstract

Paper Structure (28 sections, 8 equations, 9 figures, 13 tables)

This paper contains 28 sections, 8 equations, 9 figures, 13 tables.

Introduction
Method
Motivation
Proposed Model Framework
Training Objective
VQScore for Speech Quality Estimation
Self-Distillation with Adversarial Training to Improve Model Robustness for Speech Enhancement
Experiments
Test Sets and Baselines for Speech Quality Estimation
Experimental Results of Speech Quality Estimation
Test Sets and Baselines for Speech Enhancement
Experimental Results of Speech Enhancement
Speech Enhancement Results of Matched and Mismatched Conditions
Results of listening test
Conclusion
...and 13 more sections

Figures (9)

Figure 1: Proposed VQ-VAE for self-supervised speech quality estimation and enhancement. The Transformer blocks are only used for speech enhancement.
Figure 2: Learning curves of the correlation coefficient between various objective metrics and the proposed VQScore$_{(cos, z)}$ on the VoiceBank-DEMAND noisy test set valentini2016investigating.
Figure 3: Scatter plots between various objective metrics and the proposed VQScore$_{(cos, z)}$ on the VoiceBank-DEMAND noisy test set. (a) SIG, (b) BAK, (c) OVR, and (d) PESQ.
Figure 4: Scatter plots between real subjective quality scores and the proposed VQScore$_{(cos, z)}$ on (a) IUB_cosine, (b) IUB_voices, (c) Tencent_woR, and (d) Tencent_wR.
Figure 5: Examples of spectrogram, its corresponding frame-level SNR and the predicted frame-level quality. (c) and (d) are the frame-level SNR. (e) and (f) are our predicted frame-level quality.
...and 4 more figures

Self-Supervised Speech Quality Estimation and Enhancement Using Only Clean Speech

TL;DR

Abstract

Self-Supervised Speech Quality Estimation and Enhancement Using Only Clean Speech

Authors

TL;DR

Abstract

Table of Contents

Figures (9)