A Study on Zero-shot Non-intrusive Speech Assessment using Large Language Models
Ryandhimas E. Zezario, Sabato M. Siniscalchi, Hsin-Min Wang, Yu Tsao
TL;DR
The paper tackles zero-shot, non-intrusive speech assessment using large language models by comparing direct audio analysis with a GPT-Whisper pipeline that uses Whisper ASR and GPT-4o-based naturalness evaluation. GPT-4o alone shows limited capability for accurate speech assessment, while GPT-Whisper achieves moderate alignment with human quality (SRCC ≈ 0.4360) and intelligibility (SRCC ≈ 0.5485), and a strong correlation with Whisper CER (SRCC ≈ 0.7784). When benchmarked against DNSMOS, SpeechLMScore, MOS-SSL, and MTI-Net, GPT-Whisper excels in intelligibility and surpasses several supervised baselines in predicting Whisper CER. The findings validate GPT-Whisper as a promising zero-shot metric for speech quality, intelligibility, and CER, highlighting potential for reducing labeled data needs and enabling prompt-driven evaluation. Future work will explore richer prompt engineering and tighter integration with speech-processing workflows.
Abstract
This work investigates two strategies for zero-shot non-intrusive speech assessment leveraging large language models. First, we explore the audio analysis capabilities of GPT-4o. Second, we propose GPT-Whisper, which uses Whisper as an audio-to-text module and evaluates the naturalness of text via targeted prompt engineering. We evaluate the assessment metrics predicted by GPT-4o and GPT-Whisper, examining their correlation with human-based quality and intelligibility assessments and the character error rate (CER) of automatic speech recognition. Experimental results show that GPT-4o alone is less effective for audio analysis, while GPT-Whisper achieves higher prediction accuracy, has moderate correlation with speech quality and intelligibility, and has higher correlation with CER. Compared to SpeechLMScore and DNSMOS, GPT-Whisper excels in intelligibility metrics, but performs slightly worse than SpeechLMScore in quality estimation. Furthermore, GPT-Whisper outperforms supervised non-intrusive models MOS-SSL and MTI-Net in Spearman's rank correlation for CER of Whisper. These findings validate GPT-Whisper's potential for zero-shot speech assessment without requiring additional training data.
