Table of Contents
Fetching ...

Towards explainable reference-free speech intelligibility evaluation of people with pathological speech

Bence Mark Halpern, Thomas Tienkamp, Defne Abur, Tomoki Toda

TL;DR

This work introduces the ASR Inconsistency Score, a reference-free, explainable metric for estimating intelligibility in pathological speech by measuring the mismatch between a perceptual transcription $W_{ ext{greedy}}$ and a generated reference $W_{ ext{ref}}$ produced either via $W_{ ext{improved}}$ with $N$-gram beam-search or $W_{ ext{LLM}}$ corrections. By computing $\mathrm{WER}(W_{ ext{greedy}}, W_{ ext{ref}})$, the method yields an intelligibility score that correlates highly with expert perceptual ratings across Dutch, Spanish, and English datasets, often approaching the reference-based WER upper bound. The study systematically compares 5-gram and large-language-model (GPT-3.5-turbo, GPT-4.1) references, finding that LLMs generally improve performance and that explainability is enhanced through transparent differences between $W_{ ext{greedy}}$ and $W_{ ext{LLM}}$. These results suggest a practically viable, interpretable alternative to reference-based metrics with potential for clinical adoption and cross-language applicability, while also highlighting limitations in perfect reference reconstruction, especially for severely impaired speech.

Abstract

Objective assessment of speech that reflects meaningful changes in communication is crucial for clinical decision making and reproducible research. While existing objective assessments, particularly reference-based approaches, can capture intelligibility changes, they are often hindered by lack of explainability and the need for labor-intensive manual transcriptions. To address these issues, this work proposes the reference-free, explainable ASR Inconsistency Score. We evaluate this method on pathological speech in Dutch, Spanish and English, and compare its performance to a reference-based Word Error Rate (WER) baseline. Our results demonstrate that the ASR Inconsistency Score achieves a high correlation with expert perceptual ratings, with performance closely matching, and in one case exceeding, a standard reference-based Word Error Rate (WER) baseline.

Towards explainable reference-free speech intelligibility evaluation of people with pathological speech

TL;DR

This work introduces the ASR Inconsistency Score, a reference-free, explainable metric for estimating intelligibility in pathological speech by measuring the mismatch between a perceptual transcription and a generated reference produced either via with -gram beam-search or corrections. By computing , the method yields an intelligibility score that correlates highly with expert perceptual ratings across Dutch, Spanish, and English datasets, often approaching the reference-based WER upper bound. The study systematically compares 5-gram and large-language-model (GPT-3.5-turbo, GPT-4.1) references, finding that LLMs generally improve performance and that explainability is enhanced through transparent differences between and . These results suggest a practically viable, interpretable alternative to reference-based metrics with potential for clinical adoption and cross-language applicability, while also highlighting limitations in perfect reference reconstruction, especially for severely impaired speech.

Abstract

Objective assessment of speech that reflects meaningful changes in communication is crucial for clinical decision making and reproducible research. While existing objective assessments, particularly reference-based approaches, can capture intelligibility changes, they are often hindered by lack of explainability and the need for labor-intensive manual transcriptions. To address these issues, this work proposes the reference-free, explainable ASR Inconsistency Score. We evaluate this method on pathological speech in Dutch, Spanish and English, and compare its performance to a reference-based Word Error Rate (WER) baseline. Our results demonstrate that the ASR Inconsistency Score achieves a high correlation with expert perceptual ratings, with performance closely matching, and in one case exceeding, a standard reference-based Word Error Rate (WER) baseline.
Paper Structure (16 sections, 4 equations, 3 tables)