Table of Contents
Fetching ...

MultiMed: Multilingual Medical Speech Recognition via Attention Encoder Decoder

Khai Le-Duc, Phuc Phan, Tan-Hanh Pham, Bach Phan Tat, Minh-Huong Ngo, Chris Ngo, Thanh Nguyen-Tang, Truong-Son Hy

TL;DR

MultiMed addresses the paucity of publicly available multilingual medical ASR resources by introducing a real-world dataset across Vietnamese, English, German, French, and Mandarin Chinese and by releasing small-to-large end-to-end models. The study systematically compares Attention Encoder-Decoder (AED) and Hybrid approaches and analyzes monolingual versus multilingual fine-tuning under fixed parameter budgets, revealing that multilingual fine-tuning generally improves performance while Hybrid methods offer data efficiency. Key findings show larger models and decoder-focused fine-tuning often yield the best results in several languages, though Chinese benefits from language-specific configurations, and multilingual training can degrade performance for some tonesensitive languages. The work also provides an in-depth ablation of freezing schemes and a linguistically informed error analysis, with practical training schemes and public code/data to foster reproducibility and industry adoption in medical ASR.

Abstract

Multilingual automatic speech recognition (ASR) in the medical domain serves as a foundational task for various downstream applications such as speech translation, spoken language understanding, and voice-activated assistants. This technology improves patient care by enabling efficient communication across language barriers, alleviating specialized workforce shortages, and facilitating improved diagnosis and treatment, particularly during pandemics. In this work, we introduce MultiMed, the first multilingual medical ASR dataset, along with the first collection of small-to-large end-to-end medical ASR models, spanning five languages: Vietnamese, English, German, French, and Mandarin Chinese. To our best knowledge, MultiMed stands as the world's largest medical ASR dataset across all major benchmarks: total duration, number of recording conditions, number of accents, and number of speaking roles. Furthermore, we present the first multilinguality study for medical ASR, which includes reproducible empirical baselines, a monolinguality-multilinguality analysis, Attention Encoder Decoder (AED) vs Hybrid comparative study and a linguistic analysis. We present practical ASR end-to-end training schemes optimized for a fixed number of trainable parameters that are common in industry settings. All code, data, and models are available online: https://github.com/leduckhai/MultiMed/tree/master/MultiMed.

MultiMed: Multilingual Medical Speech Recognition via Attention Encoder Decoder

TL;DR

MultiMed addresses the paucity of publicly available multilingual medical ASR resources by introducing a real-world dataset across Vietnamese, English, German, French, and Mandarin Chinese and by releasing small-to-large end-to-end models. The study systematically compares Attention Encoder-Decoder (AED) and Hybrid approaches and analyzes monolingual versus multilingual fine-tuning under fixed parameter budgets, revealing that multilingual fine-tuning generally improves performance while Hybrid methods offer data efficiency. Key findings show larger models and decoder-focused fine-tuning often yield the best results in several languages, though Chinese benefits from language-specific configurations, and multilingual training can degrade performance for some tonesensitive languages. The work also provides an in-depth ablation of freezing schemes and a linguistically informed error analysis, with practical training schemes and public code/data to foster reproducibility and industry adoption in medical ASR.

Abstract

Multilingual automatic speech recognition (ASR) in the medical domain serves as a foundational task for various downstream applications such as speech translation, spoken language understanding, and voice-activated assistants. This technology improves patient care by enabling efficient communication across language barriers, alleviating specialized workforce shortages, and facilitating improved diagnosis and treatment, particularly during pandemics. In this work, we introduce MultiMed, the first multilingual medical ASR dataset, along with the first collection of small-to-large end-to-end medical ASR models, spanning five languages: Vietnamese, English, German, French, and Mandarin Chinese. To our best knowledge, MultiMed stands as the world's largest medical ASR dataset across all major benchmarks: total duration, number of recording conditions, number of accents, and number of speaking roles. Furthermore, we present the first multilinguality study for medical ASR, which includes reproducible empirical baselines, a monolinguality-multilinguality analysis, Attention Encoder Decoder (AED) vs Hybrid comparative study and a linguistic analysis. We present practical ASR end-to-end training schemes optimized for a fixed number of trainable parameters that are common in industry settings. All code, data, and models are available online: https://github.com/leduckhai/MultiMed/tree/master/MultiMed.
Paper Structure (46 sections, 35 equations, 1 figure, 11 tables)

This paper contains 46 sections, 35 equations, 1 figure, 11 tables.

Figures (1)

  • Figure 1: Illustrating the performance comparison of Whisper models trained on two distinct audio segmentation approaches for German language data: human-segmented short audio clips and concatenated continual audio segments of approximately maximum 15 seconds in length. We evaluate performance using both Whisper Small and Whisper Medium model sizes. The results demonstrate a notable improvement in model performance when trained on concatenated audio, highlighting the efficacy of this data preparation technique in enhancing transcription accuracy in the German language context