Table of Contents
Fetching ...

High-precision medical speech recognition through synthetic data and semantic correction: UNITED-MEDASR

Sourav Banerjee, Ayushi Agarwal, Promila Ghosh

TL;DR

United-MedASR tackles the challenge of domain-specific medical speech recognition by fusing synthetic data generation from authoritative vocabularies (ICD-10, MIMS, FDA) with precision Whisper fine-tuning, faster inference via Faster-Whisper, and a BART-based semantic enhancer. The approach yields outstanding WER performance across multiple benchmarks, including sub-$1\%$ on LibriSpeech and near-zero improvements on Europarl, TED-LIUM, and FLEURS, while maintaining medical terminology fidelity. The work demonstrates a scalable, adaptable architecture that can be replicated across domains and suggests a practical path toward high-precision, privacy-conscious clinical ASR with strong potential for real-world deployment. This combination of synthetic data pipelines, efficient decoding, and semantic correction represents a significant step toward reliable, domain-aware speech transcription in healthcare and beyond.

Abstract

Automatic Speech Recognition (ASR) systems in the clinical domain face significant challenges, notably the need to recognise specialised medical vocabulary accurately and meet stringent precision requirements. We introduce United-MedASR, a novel architecture that addresses these challenges by integrating synthetic data generation, precision ASR fine-tuning, and advanced semantic enhancement techniques. United-MedASR constructs a specialised medical vocabulary by synthesising data from authoritative sources such as ICD-10 (International Classification of Diseases, 10th Revision), MIMS (Monthly Index of Medical Specialties), and FDA databases. This enriched vocabulary helps finetune the Whisper ASR model to better cater to clinical needs. To enhance processing speed, we incorporate Faster Whisper, ensuring streamlined and high-speed ASR performance. Additionally, we employ a customised BART-based semantic enhancer to handle intricate medical terminology, thereby increasing accuracy efficiently. Our layered approach establishes new benchmarks in ASR performance, achieving a Word Error Rate (WER) of 0.985% on LibriSpeech test-clean, 0.26% on Europarl-ASR EN Guest-test, and demonstrating robust performance on Tedlium (0.29% WER) and FLEURS (0.336% WER). Furthermore, we present an adaptable architecture that can be replicated across different domains, making it a versatile solution for domain-specific ASR systems.

High-precision medical speech recognition through synthetic data and semantic correction: UNITED-MEDASR

TL;DR

United-MedASR tackles the challenge of domain-specific medical speech recognition by fusing synthetic data generation from authoritative vocabularies (ICD-10, MIMS, FDA) with precision Whisper fine-tuning, faster inference via Faster-Whisper, and a BART-based semantic enhancer. The approach yields outstanding WER performance across multiple benchmarks, including sub- on LibriSpeech and near-zero improvements on Europarl, TED-LIUM, and FLEURS, while maintaining medical terminology fidelity. The work demonstrates a scalable, adaptable architecture that can be replicated across domains and suggests a practical path toward high-precision, privacy-conscious clinical ASR with strong potential for real-world deployment. This combination of synthetic data pipelines, efficient decoding, and semantic correction represents a significant step toward reliable, domain-aware speech transcription in healthcare and beyond.

Abstract

Automatic Speech Recognition (ASR) systems in the clinical domain face significant challenges, notably the need to recognise specialised medical vocabulary accurately and meet stringent precision requirements. We introduce United-MedASR, a novel architecture that addresses these challenges by integrating synthetic data generation, precision ASR fine-tuning, and advanced semantic enhancement techniques. United-MedASR constructs a specialised medical vocabulary by synthesising data from authoritative sources such as ICD-10 (International Classification of Diseases, 10th Revision), MIMS (Monthly Index of Medical Specialties), and FDA databases. This enriched vocabulary helps finetune the Whisper ASR model to better cater to clinical needs. To enhance processing speed, we incorporate Faster Whisper, ensuring streamlined and high-speed ASR performance. Additionally, we employ a customised BART-based semantic enhancer to handle intricate medical terminology, thereby increasing accuracy efficiently. Our layered approach establishes new benchmarks in ASR performance, achieving a Word Error Rate (WER) of 0.985% on LibriSpeech test-clean, 0.26% on Europarl-ASR EN Guest-test, and demonstrating robust performance on Tedlium (0.29% WER) and FLEURS (0.336% WER). Furthermore, we present an adaptable architecture that can be replicated across different domains, making it a versatile solution for domain-specific ASR systems.

Paper Structure

This paper contains 26 sections, 2 equations, 6 figures, 3 tables, 7 algorithms.

Figures (6)

  • Figure 1: End-to-End Workflow of United-MedASR ASR System Development
  • Figure 2: Synthetic Data Pipeline and United-MedASR Training Process
  • Figure 3: Synthetic Data Pipeline and United-MedASR Training Process
  • Figure 4: Performance Metrics of Fine Tuning of the Whisper and Bart-Base on Clinical Data.
  • Figure 5: United-MedASR Benchmarks Evaluation Flow.
  • ...and 1 more figures