Towards explainable reference-free speech intelligibility evaluation of people with pathological speech

Bence Mark Halpern; Thomas Tienkamp; Defne Abur; Tomoki Toda

Towards explainable reference-free speech intelligibility evaluation of people with pathological speech

Bence Mark Halpern, Thomas Tienkamp, Defne Abur, Tomoki Toda

TL;DR

This work introduces the ASR Inconsistency Score, a reference-free, explainable metric for estimating intelligibility in pathological speech by measuring the mismatch between a perceptual transcription $W_{ ext{greedy}}$ and a generated reference $W_{ ext{ref}}$ produced either via $W_{ ext{improved}}$ with $N$-gram beam-search or $W_{ ext{LLM}}$ corrections. By computing $\mathrm{WER}(W_{ ext{greedy}}, W_{ ext{ref}})$, the method yields an intelligibility score that correlates highly with expert perceptual ratings across Dutch, Spanish, and English datasets, often approaching the reference-based WER upper bound. The study systematically compares 5-gram and large-language-model (GPT-3.5-turbo, GPT-4.1) references, finding that LLMs generally improve performance and that explainability is enhanced through transparent differences between $W_{ ext{greedy}}$ and $W_{ ext{LLM}}$. These results suggest a practically viable, interpretable alternative to reference-based metrics with potential for clinical adoption and cross-language applicability, while also highlighting limitations in perfect reference reconstruction, especially for severely impaired speech.

Abstract

Objective assessment of speech that reflects meaningful changes in communication is crucial for clinical decision making and reproducible research. While existing objective assessments, particularly reference-based approaches, can capture intelligibility changes, they are often hindered by lack of explainability and the need for labor-intensive manual transcriptions. To address these issues, this work proposes the reference-free, explainable ASR Inconsistency Score. We evaluate this method on pathological speech in Dutch, Spanish and English, and compare its performance to a reference-based Word Error Rate (WER) baseline. Our results demonstrate that the ASR Inconsistency Score achieves a high correlation with expert perceptual ratings, with performance closely matching, and in one case exceeding, a standard reference-based Word Error Rate (WER) baseline.

Towards explainable reference-free speech intelligibility evaluation of people with pathological speech

TL;DR

and a generated reference

produced either via

with

-gram beam-search or

corrections. By computing

, the method yields an intelligibility score that correlates highly with expert perceptual ratings across Dutch, Spanish, and English datasets, often approaching the reference-based WER upper bound. The study systematically compares 5-gram and large-language-model (GPT-3.5-turbo, GPT-4.1) references, finding that LLMs generally improve performance and that explainability is enhanced through transparent differences between

and

. These results suggest a practically viable, interpretable alternative to reference-based metrics with potential for clinical adoption and cross-language applicability, while also highlighting limitations in perfect reference reconstruction, especially for severely impaired speech.

Abstract

Paper Structure (16 sections, 4 equations, 3 tables)

This paper contains 16 sections, 4 equations, 3 tables.

Introduction
Proposed method
Acoustic-Driven Transcription (Wgreedy)
Reference Generation Method 1: n-gram Beam-Search Decoding ($W_{\text{improved}}$)
Reference Generation Method 2: Large Language Model (LLM) Correction ($W_{\text{LLM}}$)
Final Score Calculation
Experimental settings
Evaluation datasets
Implementation details
Evaluation protocol
Results and Discussion
RQ1: Performance
RQ2: Language model choice
RQ3: Explainability
Conclusion
...and 1 more sections

Towards explainable reference-free speech intelligibility evaluation of people with pathological speech

TL;DR

Abstract

Towards explainable reference-free speech intelligibility evaluation of people with pathological speech

Authors

TL;DR

Abstract

Table of Contents