Table of Contents
Fetching ...

Cross-Lingual Empirical Evaluation of Large Language Models for Arabic Medical Tasks

Chaimae Abouzahir, Congbo Ma, Nizar Habash, Farah E. Shamout

TL;DR

The paper addresses the problem of Arabic medical QA under modern LLMs, where performance lags behind English despite similar architectures. It introduces a cross-lingual diagnostic framework and uses MedAraBench with its English translations to disentangle language effects from medical content, enabling controlled comparisons. Key findings reveal that Arabic degradation arises from interacting representation, alignment, and evaluation factors, which are exacerbated by increased task complexity and longer inputs; tokenization fragmentation and miscalibrated confidence/explanations further contribute. The work provides a language-aware evaluation and design approach and offers a transferable framework for diagnosing multilingual medical AI systems in diverse clinical contexts.

Abstract

In recent years, Large Language Models (LLMs) have become widely used in medical applications, such as clinical decision support, medical education, and medical question answering. Yet, these models are often English-centric, limiting their robustness and reliability for linguistically diverse communities. Recent work has highlighted discrepancies in performance in low-resource languages for various medical tasks, but the underlying causes remain poorly understood. In this study, we conduct a cross-lingual empirical analysis of LLM performance on Arabic and English medical question and answering. Our findings reveal a persistent language-driven performance gap that intensifies with increasing task complexity. Tokenization analysis exposes structural fragmentation in Arabic medical text, while reliability analysis suggests that model-reported confidence and explanations exhibit limited correlation with correctness. Together, these findings underscore the need for language-aware design and evaluation strategies in LLMs for medical tasks.

Cross-Lingual Empirical Evaluation of Large Language Models for Arabic Medical Tasks

TL;DR

The paper addresses the problem of Arabic medical QA under modern LLMs, where performance lags behind English despite similar architectures. It introduces a cross-lingual diagnostic framework and uses MedAraBench with its English translations to disentangle language effects from medical content, enabling controlled comparisons. Key findings reveal that Arabic degradation arises from interacting representation, alignment, and evaluation factors, which are exacerbated by increased task complexity and longer inputs; tokenization fragmentation and miscalibrated confidence/explanations further contribute. The work provides a language-aware evaluation and design approach and offers a transferable framework for diagnosing multilingual medical AI systems in diverse clinical contexts.

Abstract

In recent years, Large Language Models (LLMs) have become widely used in medical applications, such as clinical decision support, medical education, and medical question answering. Yet, these models are often English-centric, limiting their robustness and reliability for linguistically diverse communities. Recent work has highlighted discrepancies in performance in low-resource languages for various medical tasks, but the underlying causes remain poorly understood. In this study, we conduct a cross-lingual empirical analysis of LLM performance on Arabic and English medical question and answering. Our findings reveal a persistent language-driven performance gap that intensifies with increasing task complexity. Tokenization analysis exposes structural fragmentation in Arabic medical text, while reliability analysis suggests that model-reported confidence and explanations exhibit limited correlation with correctness. Together, these findings underscore the need for language-aware design and evaluation strategies in LLMs for medical tasks.
Paper Structure (30 sections, 11 figures, 3 tables)

This paper contains 30 sections, 11 figures, 3 tables.

Figures (11)

  • Figure 1: Effect of question length on accuracy across Arabic and English. (a–b) Rolling accuracy versus question length for DeepSeek-V3.2 and Med42-70B, respectively. (c) Distribution of question lengths in both languages. (d) Arabic–English length correspondence for aligned question pairs.
  • Figure 2: Accuracy by educational difficulty level (early vs. later years) for DeepSeek-V3.2 and Med42-70B on Arabic and English medical MCQs.
  • Figure 3: Accuracy by medical specialty for DeepSeek-V3.2 (top) and Med42-70B (bottom) on Arabic and English medical MCQs.
  • Figure 4: Token-level sequence-match accuracy (%) for text-matching evaluation in Arabic and English across two models.
  • Figure 5: Relationship between model-reported confidence and accuracy for Arabic (top) and English (bottom) multiple-choice medical MCQ.
  • ...and 6 more figures