Table of Contents
Fetching ...

Clinical BERTScore: An Improved Measure of Automatic Speech Recognition Performance in Clinical Settings

Joel Shor, Ruyue Agnes Bi, Subhashini Venugopalan, Steven Ibara, Roman Goldenberg, Ehud Rivlin

TL;DR

The Clinical BERTScore (CBERTScore), an ASR metric that penalizes clinically-relevant mistakes more than others, is presented and it is demonstrated that the metric more closely aligns with clinician preferences on medical sentences as compared to other metrics (WER, BLUE, METEOR, etc), sometimes by wide margins.

Abstract

Automatic Speech Recognition (ASR) in medical contexts has the potential to save time, cut costs, increase report accuracy, and reduce physician burnout. However, the healthcare industry has been slower to adopt this technology, in part due to the importance of avoiding medically-relevant transcription mistakes. In this work, we present the Clinical BERTScore (CBERTScore), an ASR metric that penalizes clinically-relevant mistakes more than others. We demonstrate that this metric more closely aligns with clinician preferences on medical sentences as compared to other metrics (WER, BLUE, METEOR, etc), sometimes by wide margins. We collect a benchmark of 18 clinician preferences on 149 realistic medical sentences called the Clinician Transcript Preference benchmark (CTP) and make it publicly available for the community to further develop clinically-aware ASR metrics. To our knowledge, this is the first public dataset of its kind. We demonstrate that CBERTScore more closely matches what clinicians prefer.

Clinical BERTScore: An Improved Measure of Automatic Speech Recognition Performance in Clinical Settings

TL;DR

The Clinical BERTScore (CBERTScore), an ASR metric that penalizes clinically-relevant mistakes more than others, is presented and it is demonstrated that the metric more closely aligns with clinician preferences on medical sentences as compared to other metrics (WER, BLUE, METEOR, etc), sometimes by wide margins.

Abstract

Automatic Speech Recognition (ASR) in medical contexts has the potential to save time, cut costs, increase report accuracy, and reduce physician burnout. However, the healthcare industry has been slower to adopt this technology, in part due to the importance of avoiding medically-relevant transcription mistakes. In this work, we present the Clinical BERTScore (CBERTScore), an ASR metric that penalizes clinically-relevant mistakes more than others. We demonstrate that this metric more closely aligns with clinician preferences on medical sentences as compared to other metrics (WER, BLUE, METEOR, etc), sometimes by wide margins. We collect a benchmark of 18 clinician preferences on 149 realistic medical sentences called the Clinician Transcript Preference benchmark (CTP) and make it publicly available for the community to further develop clinically-aware ASR metrics. To our knowledge, this is the first public dataset of its kind. We demonstrate that CBERTScore more closely matches what clinicians prefer.
Paper Structure (20 sections, 3 equations, 3 figures)

This paper contains 20 sections, 3 equations, 3 figures.

Figures (3)

  • Figure 1: Left: Background of the clinicians who were surveyed to create the Clinician Transcript Preference (CTP) dataset. Right: Some examples of triplet medical sentences, which transcript clinicians prefer, and which transcript scores better based on different metrics.
  • Figure 2: Comparison of different metrics' agreement with human rater transcript preferences. Process of deriving a prediction from metric values is described in Sec. \ref{['subsubsec:ctp']}. In all plots, "CBERTScore1.0" is the performance from only the medical term component ($k=1.0$ in Sec. \ref{['sec:cbert_def']}). "CBERTScore0.4" uses the optimal value of $k$ according to the train set. Left: Agreements with clinicians on the CTP benchmark when labels are derived using majority voting. Center: Agreements with clinicians on the CTP benchmark when restricted to questions with unanimous answers. Right: Agreement with speech pathologist raters on the non-medical dataset, when restricting the data to cases where there is a fidelity difference between two candidate transcripts.
  • Figure 3: Fraction of cases where metric Y is correctly conditioned on metric X and Y disagreeing. An indicator of how similar the pattern of mistakes is between metrics.