Table of Contents
Fetching ...

Development and multi-center evaluation of domain-adapted speech recognition for human-AI teaming in real-world gastrointestinal endoscopy

Ruijie Yang, Yan Zhu, Peiyao Fu, Te Luo, Zhihua Wang, Xian Yang, Quanlin Li, Pinghong Zhou, Shuo Wang

Abstract

Automatic speech recognition (ASR) is a critical interface for human-AI interaction in gastrointestinal endoscopy, yet its reliability in real-world clinical settings is limited by domain-specific terminology and complex acoustic conditions. Here, we present EndoASR, a domain-adapted ASR system designed for real-time deployment in endoscopic workflows. We develop a two-stage adaptation strategy based on synthetic endoscopy reports, targeting domain-specific language modeling and noise robustness. In retrospective evaluation across six endoscopists, EndoASR substantially improves both transcription accuracy and clinical usability, reducing character error rate (CER) from 20.52% to 14.14% and increasing medical term accuracy (Med ACC) from 54.30% to 87.59%. In a prospective multi-center study spanning five independent endoscopy centers, EndoASR demonstrates consistent generalization under heterogeneous real-world conditions. Compared with the baseline Paraformer model, CER is reduced from 16.20% to 14.97%, while Med ACC is improved from 61.63% to 84.16%, confirming its robustness in practical deployment scenarios. Notably, EndoASR achieves a real-time factor (RTF) of 0.005, significantly faster than Whisper-large-v3 (RTF 0.055), while maintaining a compact model size of 220M parameters, enabling efficient edge deployment. Furthermore, integration with large language models demonstrates that improved ASR quality directly enhances downstream structured information extraction and clinician-AI interaction. These results demonstrate that domain-adapted ASR can serve as a reliable interface for human-AI teaming in gastrointestinal endoscopy, with consistent performance validated across multi-center real-world clinical settings.

Development and multi-center evaluation of domain-adapted speech recognition for human-AI teaming in real-world gastrointestinal endoscopy

Abstract

Automatic speech recognition (ASR) is a critical interface for human-AI interaction in gastrointestinal endoscopy, yet its reliability in real-world clinical settings is limited by domain-specific terminology and complex acoustic conditions. Here, we present EndoASR, a domain-adapted ASR system designed for real-time deployment in endoscopic workflows. We develop a two-stage adaptation strategy based on synthetic endoscopy reports, targeting domain-specific language modeling and noise robustness. In retrospective evaluation across six endoscopists, EndoASR substantially improves both transcription accuracy and clinical usability, reducing character error rate (CER) from 20.52% to 14.14% and increasing medical term accuracy (Med ACC) from 54.30% to 87.59%. In a prospective multi-center study spanning five independent endoscopy centers, EndoASR demonstrates consistent generalization under heterogeneous real-world conditions. Compared with the baseline Paraformer model, CER is reduced from 16.20% to 14.97%, while Med ACC is improved from 61.63% to 84.16%, confirming its robustness in practical deployment scenarios. Notably, EndoASR achieves a real-time factor (RTF) of 0.005, significantly faster than Whisper-large-v3 (RTF 0.055), while maintaining a compact model size of 220M parameters, enabling efficient edge deployment. Furthermore, integration with large language models demonstrates that improved ASR quality directly enhances downstream structured information extraction and clinician-AI interaction. These results demonstrate that domain-adapted ASR can serve as a reliable interface for human-AI teaming in gastrointestinal endoscopy, with consistent performance validated across multi-center real-world clinical settings.

Paper Structure

This paper contains 15 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Overall framework of EndoASR for real-world deployment in gastrointestinal endoscopy. The figure illustrates the conceptual and methodological design of EndoASR, spanning system motivation, model development, and clinical validation. Panel (a) presents the paradigm shift from algorithm-centric medical AI toward agent-based human–AI teaming, in which ASR serves as a real-time speech interface in hands-busy endoscopic workflows. Panel (b) summarizes the key challenges of speech recognition in endoscopy rooms, including specialized medical terminology, complex procedural acoustics, and real-time operational constraints. Panel (c) depicts the two-stage domain adaptation strategy, in which synthetic speech derived from structured endoscopy reports is used for domain-specific language adaptation, followed by noise-aware fine-tuning to improve robustness under realistic operating-room conditions. Panel (d) shows the progressive evaluation design, combining retrospective single-center validation across multiple endoscopists with prospective multi-center validation across diverse clinical content categories, enabling assessment of both controlled performance and real-world generalization.
  • Figure 2: Retrospective evaluation of ASR performance, inter-speaker variability, and noise-related robustness analysis. Panel (a) compares the performance of different ASR models on the retrospective clinical dataset across four metrics, including 1–CER, BLEU-1, BERTScore, and medical terminology accuracy. Panel (b) reports model performance stratified by individual endoscopists (P1–P6) on the same retrospective dataset, highlighting speaker-dependent variations across evaluation metrics. Panel (c) further visualizes inter-speaker variability by summarizing performance differences across endoscopists, illustrating the heterogeneity of intra-procedural speech characteristics. Panel (d) illustrates the relationship between model parameter size and medical terminology accuracy across different methods. Models closer to the upper-left corner achieve a more favorable efficiency–accuracy trade-off, delivering higher terminology accuracy with fewer parameters.
  • Figure 3: Prospective multi-center evaluation across clinical content categories. Panel (a) summarizes ASR performance on the prospective multi-center dataset across evaluation metrics, reflecting overall recognition quality under real-world clinical conditions. Panel (b) presents medical terminology accuracy stratified by center and content category, highlighting variation in domain-specific term recognition. Panel (c) compares 1–CER performance of different ASR models across centers and clinical categories, illustrating cross-institutional and content-dependent performance patterns in prospective evaluation.
  • Figure 4: Acoustic variability across centers and the impact of synthetic data scaling on medical term recognition. (a) t-SNE visualization of MFCC-based acoustic embeddings from the prospective multi-center dataset. Each point represents a speech segment, with colors indicating different clinical centers. Distinct clustering patterns are observed across centers, reflecting substantial inter-center variability in acoustic characteristics, likely due to differences in recording environments, speakers, and procedural conditions. (b) Effect of synthetic training data scale on medical term recognition accuracy. Performance is plotted against the size of synthetic data used for domain adaptation, where the point at size 0 corresponds to the baseline model without adaptation (Paraformer). Results show a consistent improvement in accuracy as the amount of synthetic data increases.
  • Figure 5: Data construction, transcription quality, runtime efficiency, and downstream usability of EndoASR. Panel (a) illustrates the construction of the training data, including synthetic speech generated from structured clinical text and noise-augmented speech informed by real endoscopy-room acoustics. Panel (b) compares representative transcription outputs produced by Whisper-large-v3, Paraformer, and EndoASR-noise, highlighting differences in domain-specific terminology recognition and transcription errors (shown in red). Panel (c) presents a comparison of real-time factor (RTF) across different ASR models, reflecting their runtime efficiency under practical deployment settings. Panel (d) demonstrates downstream usage by integrating ASR outputs with a large language model for structured clinical information extraction, comparing the performance of EndoASR and Whisper-large-v3 in supporting accurate and usable endoscopy report generation.