Table of Contents
Fetching ...

WER is Unaware: Assessing How ASR Errors Distort Clinical Understanding in Patient Facing Dialogue

Zachary Ellis, Jared Joselowitz, Yash Deo, Yajie He, Anna Kalygina, Aisling Higham, Mana Rahimzadeh, Yan Jia, Ibrahim Habli, Ernest Lim

TL;DR

This work argues that Word Error Rate is a poor proxy for safety in patient-facing clinical dialogue due to its lack of clinical-context sensitivity. It introduces a clinician-annotated benchmark for ASR clinical impact, a robust LLM-based semantic aligner for pairing ground-truth and ASR utterances, and a GEPA-optimized LLM judge (Gemini-2.5-Pro) that achieves human-comparable accuracy (90% with $$κ = 0.816) in risk assessment. Across two real-world datasets with diverse ASR providers, the study shows existing metrics (WER, BLEURT, etc.) poorly correlate with clinical risk, underscoring the need for context-aware evaluation. The proposed framework enables scalable, risk-informed evaluation of clinical transcription safety, supporting safer deployment of automated clinical dialogue systems and governance-aligned prompt optimization.

Abstract

As Automatic Speech Recognition (ASR) is increasingly deployed in clinical dialogue, standard evaluations still rely heavily on Word Error Rate (WER). This paper challenges that standard, investigating whether WER or other common metrics correlate with the clinical impact of transcription errors. We establish a gold-standard benchmark by having expert clinicians compare ground-truth utterances to their ASR-generated counterparts, labeling the clinical impact of any discrepancies found in two distinct doctor-patient dialogue datasets. Our analysis reveals that WER and a comprehensive suite of existing metrics correlate poorly with the clinician-assigned risk labels (No, Minimal, or Significant Impact). To bridge this evaluation gap, we introduce an LLM-as-a-Judge, programmatically optimized using GEPA through DSPy to replicate expert clinical assessment. The optimized judge (Gemini-2.5-Pro) achieves human-comparable performance, obtaining 90% accuracy and a strong Cohen's $κ$ of 0.816. This work provides a validated, automated framework for moving ASR evaluation beyond simple textual fidelity to a necessary, scalable assessment of safety in clinical dialogue.

WER is Unaware: Assessing How ASR Errors Distort Clinical Understanding in Patient Facing Dialogue

TL;DR

This work argues that Word Error Rate is a poor proxy for safety in patient-facing clinical dialogue due to its lack of clinical-context sensitivity. It introduces a clinician-annotated benchmark for ASR clinical impact, a robust LLM-based semantic aligner for pairing ground-truth and ASR utterances, and a GEPA-optimized LLM judge (Gemini-2.5-Pro) that achieves human-comparable accuracy (90% with κ = 0.816) in risk assessment. Across two real-world datasets with diverse ASR providers, the study shows existing metrics (WER, BLEURT, etc.) poorly correlate with clinical risk, underscoring the need for context-aware evaluation. The proposed framework enables scalable, risk-informed evaluation of clinical transcription safety, supporting safer deployment of automated clinical dialogue systems and governance-aligned prompt optimization.

Abstract

As Automatic Speech Recognition (ASR) is increasingly deployed in clinical dialogue, standard evaluations still rely heavily on Word Error Rate (WER). This paper challenges that standard, investigating whether WER or other common metrics correlate with the clinical impact of transcription errors. We establish a gold-standard benchmark by having expert clinicians compare ground-truth utterances to their ASR-generated counterparts, labeling the clinical impact of any discrepancies found in two distinct doctor-patient dialogue datasets. Our analysis reveals that WER and a comprehensive suite of existing metrics correlate poorly with the clinician-assigned risk labels (No, Minimal, or Significant Impact). To bridge this evaluation gap, we introduce an LLM-as-a-Judge, programmatically optimized using GEPA through DSPy to replicate expert clinical assessment. The optimized judge (Gemini-2.5-Pro) achieves human-comparable performance, obtaining 90% accuracy and a strong Cohen's of 0.816. This work provides a validated, automated framework for moving ASR evaluation beyond simple textual fidelity to a necessary, scalable assessment of safety in clinical dialogue.

Paper Structure

This paper contains 52 sections, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Overview of the clinical impact evaluation framework.Left: Two examples of ASR errors in patient utterances. Middle: We curate a dataset of clinical dialogues and transcriptions, and apply a novel semantically-aware sentence alignment pipeline to enable contextual clinical evaluation. Expert clinicians annotate a dataset of these errors based on our defined scale, labelling the minor change (Ex. 1) as "Insignificant" but the clinically dangerous negation (Ex. 2) as "Impactful". Right: Existing metrics like WER and other semantic scores correlate poorly with clinical risk. Our GEPA-optimized LLM-as-a-Judge closely matches clinical expert ratings.
  • Figure 2: Clinician annotation agreement and final label distribution. Left: IAA between two clinicians on the full labelled subset ($n=298$), with most disagreements between adjacent classes (0 vs. 1), yielding 79% agreement ($\kappa=0.54$). Right: Final adjudicated labels show a predominance of no-impact cases, with fewer minimal and significant-impact examples.
  • Figure 3: Performance of the LLM-based transcript aligner across Google (Dora) and Deepgram (Primock) ASR hypotheses. The figure shows high classification accuracy ($>98\%$) and low total error counts for both golden and ASR utterances.
  • Figure 4: Mean score difference per metric on the Metrics Subset, coloured by family; more negative bars indicate stronger alignment with clinical severity.
  • Figure 5: Per class Test set results of clinicians and judge. 95% confidence interval estimated via 1,000 bootstrap iterations
  • ...and 3 more figures