Table of Contents
Fetching ...

The Sound of Healthcare: Improving Medical Transcription ASR Accuracy with Large Language Models

Ayo Adedeji, Sarita Joshi, Brendan Doohan

TL;DR

This work investigates post-processing ASR transcripts for medical dialogues with Large Language Models, using PriMock57 to evaluate general WER, MC-WER, and diarization. It systematically compares zero-shot and Chain-of-Thought prompting, supplemented by regex parsing and semantic similarity assessments via multiple embeddings, and leverages Google's Healthcare NLP for medical concept extraction. The results show that CoT prompting substantially improves diarization and MC-WER, with GPT-4 and Gemini-driven pairings achieving state-of-the-art performance in several configurations, while zero-shot prompting generally falls short of CoT. The study demonstrates that LLM-based post-processing can enhance transcription quality and semantic coherence in healthcare contexts without extensive model fine-tuning, with implications for evolving clinical documentation workflows and deployment in resource-limited environments. The findings also highlight the influence of punctuation quality and context window on performance, suggesting directions for robust, transparent interfaces that preserve original audio signals and confidence cues.

Abstract

In the rapidly evolving landscape of medical documentation, transcribing clinical dialogues accurately is increasingly paramount. This study explores the potential of Large Language Models (LLMs) to enhance the accuracy of Automatic Speech Recognition (ASR) systems in medical transcription. Utilizing the PriMock57 dataset, which encompasses a diverse range of primary care consultations, we apply advanced LLMs to refine ASR-generated transcripts. Our research is multifaceted, focusing on improvements in general Word Error Rate (WER), Medical Concept WER (MC-WER) for the accurate transcription of essential medical terms, and speaker diarization accuracy. Additionally, we assess the role of LLM post-processing in improving semantic textual similarity, thereby preserving the contextual integrity of clinical dialogues. Through a series of experiments, we compare the efficacy of zero-shot and Chain-of-Thought (CoT) prompting techniques in enhancing diarization and correction accuracy. Our findings demonstrate that LLMs, particularly through CoT prompting, not only improve the diarization accuracy of existing ASR systems but also achieve state-of-the-art performance in this domain. This improvement extends to more accurately capturing medical concepts and enhancing the overall semantic coherence of the transcribed dialogues. These findings illustrate the dual role of LLMs in augmenting ASR outputs and independently excelling in transcription tasks, holding significant promise for transforming medical ASR systems and leading to more accurate and reliable patient records in healthcare settings.

The Sound of Healthcare: Improving Medical Transcription ASR Accuracy with Large Language Models

TL;DR

This work investigates post-processing ASR transcripts for medical dialogues with Large Language Models, using PriMock57 to evaluate general WER, MC-WER, and diarization. It systematically compares zero-shot and Chain-of-Thought prompting, supplemented by regex parsing and semantic similarity assessments via multiple embeddings, and leverages Google's Healthcare NLP for medical concept extraction. The results show that CoT prompting substantially improves diarization and MC-WER, with GPT-4 and Gemini-driven pairings achieving state-of-the-art performance in several configurations, while zero-shot prompting generally falls short of CoT. The study demonstrates that LLM-based post-processing can enhance transcription quality and semantic coherence in healthcare contexts without extensive model fine-tuning, with implications for evolving clinical documentation workflows and deployment in resource-limited environments. The findings also highlight the influence of punctuation quality and context window on performance, suggesting directions for robust, transparent interfaces that preserve original audio signals and confidence cues.

Abstract

In the rapidly evolving landscape of medical documentation, transcribing clinical dialogues accurately is increasingly paramount. This study explores the potential of Large Language Models (LLMs) to enhance the accuracy of Automatic Speech Recognition (ASR) systems in medical transcription. Utilizing the PriMock57 dataset, which encompasses a diverse range of primary care consultations, we apply advanced LLMs to refine ASR-generated transcripts. Our research is multifaceted, focusing on improvements in general Word Error Rate (WER), Medical Concept WER (MC-WER) for the accurate transcription of essential medical terms, and speaker diarization accuracy. Additionally, we assess the role of LLM post-processing in improving semantic textual similarity, thereby preserving the contextual integrity of clinical dialogues. Through a series of experiments, we compare the efficacy of zero-shot and Chain-of-Thought (CoT) prompting techniques in enhancing diarization and correction accuracy. Our findings demonstrate that LLMs, particularly through CoT prompting, not only improve the diarization accuracy of existing ASR systems but also achieve state-of-the-art performance in this domain. This improvement extends to more accurately capturing medical concepts and enhancing the overall semantic coherence of the transcribed dialogues. These findings illustrate the dual role of LLMs in augmenting ASR outputs and independently excelling in transcription tasks, holding significant promise for transforming medical ASR systems and leading to more accurate and reliable patient records in healthcare settings.
Paper Structure (23 sections, 3 equations, 17 figures, 14 tables)

This paper contains 23 sections, 3 equations, 17 figures, 14 tables.

Figures (17)

  • Figure 1: Excerpt from a mock consultation showing the ground truth transcript (top) and the corresponding ASR output (bottom). Errors are annotated: substitutions in red, deletions in blue, and insertions in green.
  • Figure 2: Visual of medical concepts recognized by the Healthcare Natural Language API
  • Figure 3: Workflow of Zero-Shot Prompting for ASR Diarization and Correction.
  • Figure 4: Workflow of Chain-of-Thought Prompting for ASR Punctuation, Diarization and Correction.
  • Figure 5: Illustration of MC-WER, comparing hypothesis (H) and reference (R) transcripts. Discrepancies are highlighted in red, demonstrating a deletion and a substitution error.
  • ...and 12 more figures