Table of Contents
Fetching ...

MultiMed-ST: Large-scale Many-to-many Multilingual Medical Speech Translation

Khai Le-Duc, Tuyen Tran, Bach Phan Tat, Nguyen Kim Hai Bui, Quan Dang, Hung-Phong Tran, Thanh-Thuy Nguyen, Ly Nguyen, Tuan-Minh Phan, Thi Thu Phuong Tran, Chris Ngo, Nguyen X. Khanh, Thanh Nguyen-Tang

TL;DR

MultiMed-ST delivers a large-scale, five-language medical speech translation dataset (290{,}000 samples) spanning Vietnamese, English, German, French, and Chinese in all directions, enabling a systematic study of medical ST. The work comprehensively analyzes data collection/annotation, problem formulation (end-to-end vs cascaded), and models across bilingual and multilingual fine-tuning and pre-training settings, using both automatic and human/LLM-based evaluations. Key findings show cascaded ST generally outperforms end-to-end, bilingual fine-tuning offers advantages on ground-truth data while multilingual pre-training can match bilingual performance in cascaded setups, and code-switch handling is feasible with multilingual models; automatic metrics correlate well with human judgments in this domain. The dataset and open-source code empower robust evaluation and reproducibility, advancing practical cross-lingual communication in healthcare and setting a new benchmark for medical cross-lingual speech translation.

Abstract

Multilingual speech translation (ST) and machine translation (MT) in the medical domain enhances patient care by enabling efficient communication across language barriers, alleviating specialized workforce shortages, and facilitating improved diagnosis and treatment, particularly during pandemics. In this work, we present the first systematic study on medical ST, to our best knowledge, by releasing MultiMed-ST, a large-scale ST dataset for the medical domain, spanning all translation directions in five languages: Vietnamese, English, German, French, and Simplified/Traditional Chinese, together with the models. With 290,000 samples, this is the largest medical MT dataset and the largest many-to-many multilingual ST among all domains. Secondly, we present the most comprehensive ST analysis in the field's history, to our best knowledge, including: empirical baselines, bilingual-multilingual comparative study, end-to-end vs. cascaded comparative study, task-specific vs. multi-task sequence-to-sequence comparative study, code-switch analysis, and quantitative-qualitative error analysis. All code, data, and models are available online: https://github.com/leduckhai/MultiMed-ST

MultiMed-ST: Large-scale Many-to-many Multilingual Medical Speech Translation

TL;DR

MultiMed-ST delivers a large-scale, five-language medical speech translation dataset (290{,}000 samples) spanning Vietnamese, English, German, French, and Chinese in all directions, enabling a systematic study of medical ST. The work comprehensively analyzes data collection/annotation, problem formulation (end-to-end vs cascaded), and models across bilingual and multilingual fine-tuning and pre-training settings, using both automatic and human/LLM-based evaluations. Key findings show cascaded ST generally outperforms end-to-end, bilingual fine-tuning offers advantages on ground-truth data while multilingual pre-training can match bilingual performance in cascaded setups, and code-switch handling is feasible with multilingual models; automatic metrics correlate well with human judgments in this domain. The dataset and open-source code empower robust evaluation and reproducibility, advancing practical cross-lingual communication in healthcare and setting a new benchmark for medical cross-lingual speech translation.

Abstract

Multilingual speech translation (ST) and machine translation (MT) in the medical domain enhances patient care by enabling efficient communication across language barriers, alleviating specialized workforce shortages, and facilitating improved diagnosis and treatment, particularly during pandemics. In this work, we present the first systematic study on medical ST, to our best knowledge, by releasing MultiMed-ST, a large-scale ST dataset for the medical domain, spanning all translation directions in five languages: Vietnamese, English, German, French, and Simplified/Traditional Chinese, together with the models. With 290,000 samples, this is the largest medical MT dataset and the largest many-to-many multilingual ST among all domains. Secondly, we present the most comprehensive ST analysis in the field's history, to our best knowledge, including: empirical baselines, bilingual-multilingual comparative study, end-to-end vs. cascaded comparative study, task-specific vs. multi-task sequence-to-sequence comparative study, code-switch analysis, and quantitative-qualitative error analysis. All code, data, and models are available online: https://github.com/leduckhai/MultiMed-ST

Paper Structure

This paper contains 86 sections, 37 equations, 52 figures, 13 tables.

Figures (52)

  • Figure 1: An overview of MultiMed-ST -- A large-scale, many-to-many multilingual medical speech translation framework and dataset for facilitating cross-lingual communication in healthcare settings.
  • Figure 2: visualization. The computation of begins by dividing the original waveform into overlapping 20ms frames.
  • Figure 3: OpenAI's Whisper architecture. Whisper is a Transformer-based architecture, using features as input.
  • Figure 4: Deepgram's Nova-2 architecture. To our best understanding of Deepgram's documentation, Deepgram's Nova-2 is a Transformer-based architecture, using raw waveform as input instead of like Whisper. Feature extraction from raw waveform is probably conducted by a learnable feature encoder, e.g. a block of like wav2vec 2.0. Between encoder-decoder space, (unknown) acoustic embeddings are probably added as cross-attention.
  • Figure 5: SpecAugment visualization. From top to bottom, the figures show the spectrogram of the input audio with no data augmentation, time masking, frequency masking and both masking applied.
  • ...and 47 more figures