A Study on Incorporating Whisper for Robust Speech Assessment

Ryandhimas E. Zezario; Yu-Wen Chen; Szu-Wei Fu; Yu Tsao; Hsin-Min Wang; Chiou-Shann Fuh

A Study on Incorporating Whisper for Robust Speech Assessment

Ryandhimas E. Zezario, Yu-Wen Chen, Szu-Wei Fu, Yu Tsao, Hsin-Min Wang, Chiou-Shann Fuh

TL;DR

The paper addresses robust, non-intrusive speech quality and intelligibility assessment by integrating Whisper embeddings into MOSA-Net+, a multi-task CNN-BLSTM framework. The proposed MOSA-Net+ fuses cross-domain features, including PS, LFB, and Whisper embeddings, with a frozen Whisper branch via an adapter, guided by a joint objective $L_{All} = \gamma_{1} L_{Quality} + \gamma_{2} L_{Intelligibility}$. Experimental results on TMHINT-QI and the VoiceMOS Challenge 2023 show that Whisper-based features significantly improve prediction accuracy over HuBERT, W2V, and MMS, while combining Whisper with SSL features yields only marginal gains; Whisper-based MOSA-Net+ also achieves top performance in noisy-enhanced conditions. These findings highlight Whisper’s potential to provide robust acoustic representations for subjective speech assessment and suggest practical pathways for deploying high-accuracy, non-intrusive metrics in real-world scenarios.

Abstract

This research introduces an enhanced version of the multi-objective speech assessment model--MOSA-Net+, by leveraging the acoustic features from Whisper, a large-scaled weakly supervised model. We first investigate the effectiveness of Whisper in deploying a more robust speech assessment model. After that, we explore combining representations from Whisper and SSL models. The experimental results reveal that Whisper's embedding features can contribute to more accurate prediction performance. Moreover, combining the embedding features from Whisper and SSL models only leads to marginal improvement. As compared to intrusive methods, MOSA-Net, and other SSL-based speech assessment models, MOSA-Net+ yields notable improvements in estimating subjective quality and intelligibility scores across all evaluation metrics in Taiwan Mandarin Hearing In Noise test - Quality & Intelligibility (TMHINT-QI) dataset. To further validate its robustness, MOSA-Net+ was tested in the noisy-and-enhanced track of the VoiceMOS Challenge 2023, where it obtained the top-ranked performance among nine systems.

A Study on Incorporating Whisper for Robust Speech Assessment

TL;DR

. Experimental results on TMHINT-QI and the VoiceMOS Challenge 2023 show that Whisper-based features significantly improve prediction accuracy over HuBERT, W2V, and MMS, while combining Whisper with SSL features yields only marginal gains; Whisper-based MOSA-Net+ also achieves top performance in noisy-enhanced conditions. These findings highlight Whisper’s potential to provide robust acoustic representations for subjective speech assessment and suggest practical pathways for deploying high-accuracy, non-intrusive metrics in real-world scenarios.

Abstract

Paper Structure (12 sections, 5 equations, 3 figures, 4 tables)

This paper contains 12 sections, 5 equations, 3 figures, 4 tables.

Introduction
MOSA-Net+
Architecture
Whisper's and SSL's Embedding Analysis
Experiments
Experimental Setup
TMHINT-QI Experimental Results
Whisper for Speech Assessment Model
Comparison with other Methods
VoiceMOS Challenge 2023 Experimental Results
Comparsion of Different Versions of Whisper
Conclusions

Figures (3)

Figure 1: Architecture of the MOSA-Net+ model.
Figure 2: Correlation analysis of the embedding features between Whisper and SSL models.
Figure 3: Performance comparison between MOSA-Net+ (Whisper Medium) and MOSA-Net+(Whisper Large v3).

A Study on Incorporating Whisper for Robust Speech Assessment

TL;DR

Abstract

A Study on Incorporating Whisper for Robust Speech Assessment

Authors

TL;DR

Abstract

Table of Contents

Figures (3)