The Sound of Healthcare: Improving Medical Transcription ASR Accuracy with Large Language Models
Ayo Adedeji, Sarita Joshi, Brendan Doohan
TL;DR
This work investigates post-processing ASR transcripts for medical dialogues with Large Language Models, using PriMock57 to evaluate general WER, MC-WER, and diarization. It systematically compares zero-shot and Chain-of-Thought prompting, supplemented by regex parsing and semantic similarity assessments via multiple embeddings, and leverages Google's Healthcare NLP for medical concept extraction. The results show that CoT prompting substantially improves diarization and MC-WER, with GPT-4 and Gemini-driven pairings achieving state-of-the-art performance in several configurations, while zero-shot prompting generally falls short of CoT. The study demonstrates that LLM-based post-processing can enhance transcription quality and semantic coherence in healthcare contexts without extensive model fine-tuning, with implications for evolving clinical documentation workflows and deployment in resource-limited environments. The findings also highlight the influence of punctuation quality and context window on performance, suggesting directions for robust, transparent interfaces that preserve original audio signals and confidence cues.
Abstract
In the rapidly evolving landscape of medical documentation, transcribing clinical dialogues accurately is increasingly paramount. This study explores the potential of Large Language Models (LLMs) to enhance the accuracy of Automatic Speech Recognition (ASR) systems in medical transcription. Utilizing the PriMock57 dataset, which encompasses a diverse range of primary care consultations, we apply advanced LLMs to refine ASR-generated transcripts. Our research is multifaceted, focusing on improvements in general Word Error Rate (WER), Medical Concept WER (MC-WER) for the accurate transcription of essential medical terms, and speaker diarization accuracy. Additionally, we assess the role of LLM post-processing in improving semantic textual similarity, thereby preserving the contextual integrity of clinical dialogues. Through a series of experiments, we compare the efficacy of zero-shot and Chain-of-Thought (CoT) prompting techniques in enhancing diarization and correction accuracy. Our findings demonstrate that LLMs, particularly through CoT prompting, not only improve the diarization accuracy of existing ASR systems but also achieve state-of-the-art performance in this domain. This improvement extends to more accurately capturing medical concepts and enhancing the overall semantic coherence of the transcribed dialogues. These findings illustrate the dual role of LLMs in augmenting ASR outputs and independently excelling in transcription tasks, holding significant promise for transforming medical ASR systems and leading to more accurate and reliable patient records in healthcare settings.
