Clinical BERTScore: An Improved Measure of Automatic Speech Recognition Performance in Clinical Settings

Joel Shor; Ruyue Agnes Bi; Subhashini Venugopalan; Steven Ibara; Roman Goldenberg; Ehud Rivlin

Clinical BERTScore: An Improved Measure of Automatic Speech Recognition Performance in Clinical Settings

Joel Shor, Ruyue Agnes Bi, Subhashini Venugopalan, Steven Ibara, Roman Goldenberg, Ehud Rivlin

TL;DR

The Clinical BERTScore (CBERTScore), an ASR metric that penalizes clinically-relevant mistakes more than others, is presented and it is demonstrated that the metric more closely aligns with clinician preferences on medical sentences as compared to other metrics (WER, BLUE, METEOR, etc), sometimes by wide margins.

Abstract

Automatic Speech Recognition (ASR) in medical contexts has the potential to save time, cut costs, increase report accuracy, and reduce physician burnout. However, the healthcare industry has been slower to adopt this technology, in part due to the importance of avoiding medically-relevant transcription mistakes. In this work, we present the Clinical BERTScore (CBERTScore), an ASR metric that penalizes clinically-relevant mistakes more than others. We demonstrate that this metric more closely aligns with clinician preferences on medical sentences as compared to other metrics (WER, BLUE, METEOR, etc), sometimes by wide margins. We collect a benchmark of 18 clinician preferences on 149 realistic medical sentences called the Clinician Transcript Preference benchmark (CTP) and make it publicly available for the community to further develop clinically-aware ASR metrics. To our knowledge, this is the first public dataset of its kind. We demonstrate that CBERTScore more closely matches what clinicians prefer.

Clinical BERTScore: An Improved Measure of Automatic Speech Recognition Performance in Clinical Settings

TL;DR

Abstract

Paper Structure (20 sections, 3 equations, 3 figures)

This paper contains 20 sections, 3 equations, 3 figures.

Introduction
Related work
Methods
Clinical BERTScore
Medical Entities
Tuning the medical entities weight factor
Clinician Transcript Preference (CTP) Dataset
Constructing the CTP triplets
Evaluating metrics on the CTP
Non-medical sentences
Results
Clinician responses
Metric agreement on medical text
Metric agreement on non-medical text
Discussion
...and 5 more sections

Figures (3)

Figure 1: Left: Background of the clinicians who were surveyed to create the Clinician Transcript Preference (CTP) dataset. Right: Some examples of triplet medical sentences, which transcript clinicians prefer, and which transcript scores better based on different metrics.
Figure 2: Comparison of different metrics' agreement with human rater transcript preferences. Process of deriving a prediction from metric values is described in Sec. \ref{['subsubsec:ctp']}. In all plots, "CBERTScore1.0" is the performance from only the medical term component ($k=1.0$ in Sec. \ref{['sec:cbert_def']}). "CBERTScore0.4" uses the optimal value of $k$ according to the train set. Left: Agreements with clinicians on the CTP benchmark when labels are derived using majority voting. Center: Agreements with clinicians on the CTP benchmark when restricted to questions with unanimous answers. Right: Agreement with speech pathologist raters on the non-medical dataset, when restricting the data to cases where there is a fidelity difference between two candidate transcripts.
Figure 3: Fraction of cases where metric Y is correctly conditioned on metric X and Y disagreeing. An indicator of how similar the pattern of mistakes is between metrics.

Clinical BERTScore: An Improved Measure of Automatic Speech Recognition Performance in Clinical Settings

TL;DR

Abstract

Clinical BERTScore: An Improved Measure of Automatic Speech Recognition Performance in Clinical Settings

Authors

TL;DR

Abstract

Table of Contents

Figures (3)