Self-Supervised Speech Quality Estimation and Enhancement Using Only Clean Speech
Szu-Wei Fu, Kuo-Hsuan Hung, Yu Tsao, Yu-Chiang Frank Wang
TL;DR
The paper tackles label-efficient speech quality assessment and enhancement by introducing VQScore, a self-supervised metric derived from the quantization error of a VQ-VAE trained on clean speech, with the code-space cosine similarity given by $VQScore_{(cos,z)} = \frac{1}{T} \sum_{t=1}^{T} \cos(Z_t, Z_{q_t})$. It then extends the framework to self-supervised speech enhancement via a self-distillation approach with adversarial training, guided by the three-term loss $L = dist(X, \hat{X}) + ||sg(Z_t) - Z_{q_t}||_{2} + \beta ||Z_t - sg(Z_{q_t})||_{2}$. The method uses an Encoder E, a Vector Quantizer Q with codebook C, and a Decoder D, with Transformer blocks around the VQ and EMA-updated codebooks to enable robust performance. Empirical results show that VQScore is competitive with supervised baselines for quality estimation and that the self-supervised SE with Adversarial Training achieves strong results, particularly under domain mismatch, highlighting the practical potential of label-free speech quality assessment and enhancement.
Abstract
Speech quality estimation has recently undergone a paradigm shift from human-hearing expert designs to machine-learning models. However, current models rely mainly on supervised learning, which is time-consuming and expensive for label collection. To solve this problem, we propose VQScore, a self-supervised metric for evaluating speech based on the quantization error of a vector-quantized-variational autoencoder (VQ-VAE). The training of VQ-VAE relies on clean speech; hence, large quantization errors can be expected when the speech is distorted. To further improve correlation with real quality scores, domain knowledge of speech processing is incorporated into the model design. We found that the vector quantization mechanism could also be used for self-supervised speech enhancement (SE) model training. To improve the robustness of the encoder for SE, a novel self-distillation mechanism combined with adversarial training is introduced. In summary, the proposed speech quality estimation method and enhancement models require only clean speech for training without any label requirements. Experimental results show that the proposed VQScore and enhancement model are competitive with supervised baselines. The code will be released after publication.
