Table of Contents
Fetching ...

Deep Learning-based Non-Intrusive Multi-Objective Speech Assessment Model with Cross-Domain Features

Ryandhimas E. Zezario, Szu-Wei Fu, Fei Chen, Chiou-Shann Fuh, Hsin-Min Wang, Yu Tsao

TL;DR

MOSA-Net presents a cross-domain, multi-task speech assessment framework that fuses spectral, learned-filter, and SSL representations to predict PESQ, STOI, and SDI at frame- and utterance-level scales. It demonstrates that cross-domain features and multi-task learning yield superior prediction accuracy and generalization, including adaptation to subjective human ratings. The latent representations from MOSA-Net are further leveraged to create QIA-SE, a quality-intelligibility-aware SE system that outperforms prior model-selection approaches with reduced online computation. The work also shows MOSA-Net can be pre-trained on objective metrics and effectively transferred to predict human subjective ratings, and it improves SE performance when guidance from assessment metrics is integrated. Overall, the approach offers a practical pathway to joint assessment and enhancement of speech quality and intelligibility in diverse noisy conditions.

Abstract

In this study, we propose a cross-domain multi-objective speech assessment model called MOSA-Net, which can estimate multiple speech assessment metrics simultaneously. Experimental results show that MOSA-Net can improve the linear correlation coefficient (LCC) by 0.026 (0.990 vs 0.964 in seen noise environments) and 0.012 (0.969 vs 0.957 in unseen noise environments) in perceptual evaluation of speech quality (PESQ) prediction, compared to Quality-Net, an existing single-task model for PESQ prediction, and improve LCC by 0.021 (0.985 vs 0.964 in seen noise environments) and 0.047 (0.836 vs 0.789 in unseen noise environments) in short-time objective intelligibility (STOI) prediction, compared to STOI-Net (based on CRNN), an existing single-task model for STOI prediction. Moreover, MOSA-Net, originally trained to assess objective scores, can be used as a pre-trained model to be effectively adapted to an assessment model for predicting subjective quality and intelligibility scores with a limited amount of training data. Experimental results show that MOSA-Net can improve LCC by 0.018 (0.805 vs 0.787) in mean opinion score (MOS) prediction, compared to MOS-SSL, a strong single-task model for MOS prediction. In light of the confirmed prediction capability, we further adopt the latent representations of MOSA-Net to guide the speech enhancement (SE) process and derive a quality-intelligibility (QI)-aware SE (QIA-SE) approach accordingly. Experimental results show that QIA-SE provides superior enhancement performance compared with the baseline SE system in terms of objective evaluation metrics and qualitative evaluation test. For example, QIA-SE can improve PESQ by 0.301 (2.953 vs 2.652 in seen noise environments) and 0.18 (2.658 vs 2.478 in unseen noise environments) over a CNN-based baseline SE model.

Deep Learning-based Non-Intrusive Multi-Objective Speech Assessment Model with Cross-Domain Features

TL;DR

MOSA-Net presents a cross-domain, multi-task speech assessment framework that fuses spectral, learned-filter, and SSL representations to predict PESQ, STOI, and SDI at frame- and utterance-level scales. It demonstrates that cross-domain features and multi-task learning yield superior prediction accuracy and generalization, including adaptation to subjective human ratings. The latent representations from MOSA-Net are further leveraged to create QIA-SE, a quality-intelligibility-aware SE system that outperforms prior model-selection approaches with reduced online computation. The work also shows MOSA-Net can be pre-trained on objective metrics and effectively transferred to predict human subjective ratings, and it improves SE performance when guidance from assessment metrics is integrated. Overall, the approach offers a practical pathway to joint assessment and enhancement of speech quality and intelligibility in diverse noisy conditions.

Abstract

In this study, we propose a cross-domain multi-objective speech assessment model called MOSA-Net, which can estimate multiple speech assessment metrics simultaneously. Experimental results show that MOSA-Net can improve the linear correlation coefficient (LCC) by 0.026 (0.990 vs 0.964 in seen noise environments) and 0.012 (0.969 vs 0.957 in unseen noise environments) in perceptual evaluation of speech quality (PESQ) prediction, compared to Quality-Net, an existing single-task model for PESQ prediction, and improve LCC by 0.021 (0.985 vs 0.964 in seen noise environments) and 0.047 (0.836 vs 0.789 in unseen noise environments) in short-time objective intelligibility (STOI) prediction, compared to STOI-Net (based on CRNN), an existing single-task model for STOI prediction. Moreover, MOSA-Net, originally trained to assess objective scores, can be used as a pre-trained model to be effectively adapted to an assessment model for predicting subjective quality and intelligibility scores with a limited amount of training data. Experimental results show that MOSA-Net can improve LCC by 0.018 (0.805 vs 0.787) in mean opinion score (MOS) prediction, compared to MOS-SSL, a strong single-task model for MOS prediction. In light of the confirmed prediction capability, we further adopt the latent representations of MOSA-Net to guide the speech enhancement (SE) process and derive a quality-intelligibility (QI)-aware SE (QIA-SE) approach accordingly. Experimental results show that QIA-SE provides superior enhancement performance compared with the baseline SE system in terms of objective evaluation metrics and qualitative evaluation test. For example, QIA-SE can improve PESQ by 0.301 (2.953 vs 2.652 in seen noise environments) and 0.18 (2.658 vs 2.478 in unseen noise environments) over a CNN-based baseline SE model.

Paper Structure

This paper contains 21 sections, 3 equations, 13 figures, 9 tables.

Figures (13)

  • Figure 1: Architecture of the MOSA-Net model.
  • Figure 2: Architecture of the QIA-SE model.
  • Figure 3: Scatter plots of speech assessment predictions of MOSA-Net, Quality-Net ref_49, and STOI-Net ref_52.
  • Figure 4: Scatter plots of speech assessment predictions of the single-task and multi-task MOSA-Net models.
  • Figure 5: Latent representations of a speech utterance at the attention layer of the single-task MOSA-Net (a) PESQ, (b) STOI, and (c) SDI. The horizontal and vertical axes denote the frame index and attention weight, respectively.
  • ...and 8 more figures