Table of Contents
Fetching ...

Towards Building Multilingual Language Model for Medicine

Pengcheng Qiu, Chaoyi Wu, Xiaoman Zhang, Weixiong Lin, Haicheng Wang, Ya Zhang, Yanfeng Wang, Weidi Xie

TL;DR

This work tackles the English-centric bias of medical language models by building MMedC, a large-scale multilingual medical corpus, and MMedBench, a multilingual QA benchmark with rationale. It demonstrates that auto-regressive training on MMedC improves open-source multilingual medical LLMs, with MMed-Llama 3 (8B) delivering state-of-the-art performance across languages and competitive English benchmarks. The study also provides extensive evaluation methodologies, human-ratings analyses, and release-ready resources (data, code, models) to advance multilingual medical NLP research. Overall, the paper presents a cohesive pipeline—from data curation to benchmarking and modeling—that meaningfully expands access to medical AI across linguistic boundaries and sets a foundation for future retrieval-augmented, multilingual medical systems.

Abstract

The development of open-source, multilingual medical language models can benefit a wide, linguistically diverse audience from different regions. To promote this domain, we present contributions from the following: First, we construct a multilingual medical corpus, containing approximately 25.5B tokens encompassing 6 main languages, termed as MMedC, enabling auto-regressive domain adaptation for general LLMs; Second, to monitor the development of multilingual medical LLMs, we propose a multilingual medical multi-choice question-answering benchmark with rationale, termed as MMedBench; Third, we have assessed a number of open-source large language models (LLMs) on our benchmark, along with those further auto-regressive trained on MMedC. Our final model, MMed-Llama 3, with only 8B parameters, achieves superior performance compared to all other open-source models on both MMedBench and English benchmarks, even rivaling GPT-4. In conclusion, in this work, we present a large-scale corpus, a benchmark and a series of models to support the development of multilingual medical LLMs.

Towards Building Multilingual Language Model for Medicine

TL;DR

This work tackles the English-centric bias of medical language models by building MMedC, a large-scale multilingual medical corpus, and MMedBench, a multilingual QA benchmark with rationale. It demonstrates that auto-regressive training on MMedC improves open-source multilingual medical LLMs, with MMed-Llama 3 (8B) delivering state-of-the-art performance across languages and competitive English benchmarks. The study also provides extensive evaluation methodologies, human-ratings analyses, and release-ready resources (data, code, models) to advance multilingual medical NLP research. Overall, the paper presents a cohesive pipeline—from data curation to benchmarking and modeling—that meaningfully expands access to medical AI across linguistic boundaries and sets a foundation for future retrieval-augmented, multilingual medical systems.

Abstract

The development of open-source, multilingual medical language models can benefit a wide, linguistically diverse audience from different regions. To promote this domain, we present contributions from the following: First, we construct a multilingual medical corpus, containing approximately 25.5B tokens encompassing 6 main languages, termed as MMedC, enabling auto-regressive domain adaptation for general LLMs; Second, to monitor the development of multilingual medical LLMs, we propose a multilingual medical multi-choice question-answering benchmark with rationale, termed as MMedBench; Third, we have assessed a number of open-source large language models (LLMs) on our benchmark, along with those further auto-regressive trained on MMedC. Our final model, MMed-Llama 3, with only 8B parameters, achieves superior performance compared to all other open-source models on both MMedBench and English benchmarks, even rivaling GPT-4. In conclusion, in this work, we present a large-scale corpus, a benchmark and a series of models to support the development of multilingual medical LLMs.
Paper Structure (32 sections, 5 equations, 11 figures, 17 tables, 1 algorithm)

This paper contains 32 sections, 5 equations, 11 figures, 17 tables, 1 algorithm.

Figures (11)

  • Figure 1: Overview of our contributions.a The figure demonstrates our proposed large-scale multilingual medical corpus (MMedC), containing 25.5B tokens, covering six main languages, collected from four data sources. b The figure shows the composition of our comprehensive multilingual medical benchmark (MMedBench), that is constructed by aggregating medical QA cases in different languages, and prompting GPT-4 to provide rationale sentences. MMedBench enables the evaluation on both multi-choice accuracy and the ability of rationale generation for different LLMs under zero-shot or fine-tuning settings. c The line plot shows the final multi-choice accuracy of various LLMs on our MMedBench are shown, where our final model MMed-Llama 3 demonstrated the best performance among all existing open-source LLMs. d The comparison bar further details the gains in both multi-choice accuracy and ability of rationale generation, when comparing MMedLM 2 to InternLM 2, or comparing MMed-Llama 3 to Llama 3. Considering that the main difference between our models and their base models lies in the auto-regressive training on MMedC, such comparison highlights the importance of our contributed medical-specific multilingual language corpus.
  • Figure 1: Case 1. An English case in MMedBench
  • Figure 2: Statistic results on MMedC. a The Distribution of languages included in MMedC around the world. The map shows our collected corpora can cover most main countries worldwide. b The Token distribution for each language. The bar plot shows the detailed token number for different languages. c The Contributions of four sources to six languages for our MMedC. The Sankey diagram shows how the four considered data sources contribute for different languages, i.e., filtering content, medical textbooks, medical websites and small-scale corpus.
  • Figure 2: Case 2. A Chinese case in MMedBench
  • Figure 3: Statistic results on MMedBench.a The bar plot shows the foundation statistic number on the train and test set of MMedBench. The term "Avg. tokens" represents the mean token length per sample across various compositions in it. "Rationale" denotes the rationale sentences in answer. "Option" denotes the option descriptions in choice list and "question" denotes the question sentences. Then the term "Prop. of multi-option" denotes the proportion of the question with multiple correct options and "Prop. of single-option" denotes the proportion of those with one options in answer. The final term "Number of QA pairs" denotes how many QA pairs are in train or test split. b The statistic histogram shows the topics distribution in the test split of MMedBench, covering a wide range of medical aspects, ranging from general and specialized medicine to basic medical sciences. This allows MedQA to comprehensively measure the performance of medical models.
  • ...and 6 more figures