Table of Contents
Fetching ...

Self-Supervised Speech Quality Estimation and Enhancement Using Only Clean Speech

Szu-Wei Fu, Kuo-Hsuan Hung, Yu Tsao, Yu-Chiang Frank Wang

TL;DR

The paper tackles label-efficient speech quality assessment and enhancement by introducing VQScore, a self-supervised metric derived from the quantization error of a VQ-VAE trained on clean speech, with the code-space cosine similarity given by $VQScore_{(cos,z)} = \frac{1}{T} \sum_{t=1}^{T} \cos(Z_t, Z_{q_t})$. It then extends the framework to self-supervised speech enhancement via a self-distillation approach with adversarial training, guided by the three-term loss $L = dist(X, \hat{X}) + ||sg(Z_t) - Z_{q_t}||_{2} + \beta ||Z_t - sg(Z_{q_t})||_{2}$. The method uses an Encoder E, a Vector Quantizer Q with codebook C, and a Decoder D, with Transformer blocks around the VQ and EMA-updated codebooks to enable robust performance. Empirical results show that VQScore is competitive with supervised baselines for quality estimation and that the self-supervised SE with Adversarial Training achieves strong results, particularly under domain mismatch, highlighting the practical potential of label-free speech quality assessment and enhancement.

Abstract

Speech quality estimation has recently undergone a paradigm shift from human-hearing expert designs to machine-learning models. However, current models rely mainly on supervised learning, which is time-consuming and expensive for label collection. To solve this problem, we propose VQScore, a self-supervised metric for evaluating speech based on the quantization error of a vector-quantized-variational autoencoder (VQ-VAE). The training of VQ-VAE relies on clean speech; hence, large quantization errors can be expected when the speech is distorted. To further improve correlation with real quality scores, domain knowledge of speech processing is incorporated into the model design. We found that the vector quantization mechanism could also be used for self-supervised speech enhancement (SE) model training. To improve the robustness of the encoder for SE, a novel self-distillation mechanism combined with adversarial training is introduced. In summary, the proposed speech quality estimation method and enhancement models require only clean speech for training without any label requirements. Experimental results show that the proposed VQScore and enhancement model are competitive with supervised baselines. The code will be released after publication.

Self-Supervised Speech Quality Estimation and Enhancement Using Only Clean Speech

TL;DR

The paper tackles label-efficient speech quality assessment and enhancement by introducing VQScore, a self-supervised metric derived from the quantization error of a VQ-VAE trained on clean speech, with the code-space cosine similarity given by . It then extends the framework to self-supervised speech enhancement via a self-distillation approach with adversarial training, guided by the three-term loss . The method uses an Encoder E, a Vector Quantizer Q with codebook C, and a Decoder D, with Transformer blocks around the VQ and EMA-updated codebooks to enable robust performance. Empirical results show that VQScore is competitive with supervised baselines for quality estimation and that the self-supervised SE with Adversarial Training achieves strong results, particularly under domain mismatch, highlighting the practical potential of label-free speech quality assessment and enhancement.

Abstract

Speech quality estimation has recently undergone a paradigm shift from human-hearing expert designs to machine-learning models. However, current models rely mainly on supervised learning, which is time-consuming and expensive for label collection. To solve this problem, we propose VQScore, a self-supervised metric for evaluating speech based on the quantization error of a vector-quantized-variational autoencoder (VQ-VAE). The training of VQ-VAE relies on clean speech; hence, large quantization errors can be expected when the speech is distorted. To further improve correlation with real quality scores, domain knowledge of speech processing is incorporated into the model design. We found that the vector quantization mechanism could also be used for self-supervised speech enhancement (SE) model training. To improve the robustness of the encoder for SE, a novel self-distillation mechanism combined with adversarial training is introduced. In summary, the proposed speech quality estimation method and enhancement models require only clean speech for training without any label requirements. Experimental results show that the proposed VQScore and enhancement model are competitive with supervised baselines. The code will be released after publication.
Paper Structure (28 sections, 8 equations, 9 figures, 13 tables)

This paper contains 28 sections, 8 equations, 9 figures, 13 tables.

Figures (9)

  • Figure 1: Proposed VQ-VAE for self-supervised speech quality estimation and enhancement. The Transformer blocks are only used for speech enhancement.
  • Figure 2: Learning curves of the correlation coefficient between various objective metrics and the proposed VQScore$_{(cos, z)}$ on the VoiceBank-DEMAND noisy test set valentini2016investigating.
  • Figure 3: Scatter plots between various objective metrics and the proposed VQScore$_{(cos, z)}$ on the VoiceBank-DEMAND noisy test set. (a) SIG, (b) BAK, (c) OVR, and (d) PESQ.
  • Figure 4: Scatter plots between real subjective quality scores and the proposed VQScore$_{(cos, z)}$ on (a) IUB_cosine, (b) IUB_voices, (c) Tencent_woR, and (d) Tencent_wR.
  • Figure 5: Examples of spectrogram, its corresponding frame-level SNR and the predicted frame-level quality. (c) and (d) are the frame-level SNR. (e) and (f) are our predicted frame-level quality.
  • ...and 4 more figures