Towards Building Multilingual Language Model for Medicine
Pengcheng Qiu, Chaoyi Wu, Xiaoman Zhang, Weixiong Lin, Haicheng Wang, Ya Zhang, Yanfeng Wang, Weidi Xie
TL;DR
This work tackles the English-centric bias of medical language models by building MMedC, a large-scale multilingual medical corpus, and MMedBench, a multilingual QA benchmark with rationale. It demonstrates that auto-regressive training on MMedC improves open-source multilingual medical LLMs, with MMed-Llama 3 (8B) delivering state-of-the-art performance across languages and competitive English benchmarks. The study also provides extensive evaluation methodologies, human-ratings analyses, and release-ready resources (data, code, models) to advance multilingual medical NLP research. Overall, the paper presents a cohesive pipeline—from data curation to benchmarking and modeling—that meaningfully expands access to medical AI across linguistic boundaries and sets a foundation for future retrieval-augmented, multilingual medical systems.
Abstract
The development of open-source, multilingual medical language models can benefit a wide, linguistically diverse audience from different regions. To promote this domain, we present contributions from the following: First, we construct a multilingual medical corpus, containing approximately 25.5B tokens encompassing 6 main languages, termed as MMedC, enabling auto-regressive domain adaptation for general LLMs; Second, to monitor the development of multilingual medical LLMs, we propose a multilingual medical multi-choice question-answering benchmark with rationale, termed as MMedBench; Third, we have assessed a number of open-source large language models (LLMs) on our benchmark, along with those further auto-regressive trained on MMedC. Our final model, MMed-Llama 3, with only 8B parameters, achieves superior performance compared to all other open-source models on both MMedBench and English benchmarks, even rivaling GPT-4. In conclusion, in this work, we present a large-scale corpus, a benchmark and a series of models to support the development of multilingual medical LLMs.
