Table of Contents
Fetching ...

BiMediX2: Bio-Medical EXpert LMM for Diverse Medical Modalities

Sahal Shaji Mullappilly, Mohammed Irfan Kurpath, Sara Pieri, Saeed Yahya Alseiari, Shanavas Cholakkal, Khaled Aldahmani, Fahad Khan, Rao Anwer, Salman Khan, Timothy Baldwin, Hisham Cholakkal

TL;DR

BiMediX2 introduces a bilingual Arabic–English medical LMM with multimodal capabilities, unifying text and image reasoning to support diverse clinical tasks. The approach uses a two-stage training pipeline combining Vision-Text alignment via a Projector and LoRA-based multimodal instruction tuning on a large bilingual corpus (BiMed-V), anchored by the Arabic–English BiMed-MBench benchmark. The model achieves state-of-the-art performance across 12 medical benchmarks, excelling in VQA, report generation, and summarization, and significantly outperforms non-bilingual baselines and GPT-4o on several metrics. This work advances accessible, multilingual medical AI and provides datasets, benchmarks, and code to facilitate further research while acknowledging safety, ethical, and deployment considerations for clinical use.

Abstract

We introduce BiMediX2, a bilingual (Arabic-English) Bio-Medical EXpert Large Multimodal Model that supports text-based and image-based medical interactions. It enables multi-turn conversation in Arabic and English and supports diverse medical imaging modalities, including radiology, CT, and histology. To train BiMediX2, we curate BiMed-V, an extensive Arabic-English bilingual healthcare dataset consisting of 1.6M samples of diverse medical interactions. This dataset supports a range of medical Large Language Model (LLM) and Large Multimodal Model (LMM) tasks, including multi-turn medical conversations, report generation, and visual question answering (VQA). We also introduce BiMed-MBench, the first Arabic-English medical LMM evaluation benchmark, verified by medical experts. BiMediX2 demonstrates excellent performance across multiple medical LLM and LMM benchmarks, achieving state-of-the-art results compared to other open-sourced models. On BiMed-MBench, BiMediX2 outperforms existing methods by over 9% in English and more than 20% in Arabic evaluations. Additionally, it surpasses GPT-4 by approximately 9% in UPHILL factual accuracy evaluations and excels in various medical VQA, report generation, and report summarization tasks. Our trained models, instruction set, and source code are available at https://github.com/mbzuai-oryx/BiMediX2

BiMediX2: Bio-Medical EXpert LMM for Diverse Medical Modalities

TL;DR

BiMediX2 introduces a bilingual Arabic–English medical LMM with multimodal capabilities, unifying text and image reasoning to support diverse clinical tasks. The approach uses a two-stage training pipeline combining Vision-Text alignment via a Projector and LoRA-based multimodal instruction tuning on a large bilingual corpus (BiMed-V), anchored by the Arabic–English BiMed-MBench benchmark. The model achieves state-of-the-art performance across 12 medical benchmarks, excelling in VQA, report generation, and summarization, and significantly outperforms non-bilingual baselines and GPT-4o on several metrics. This work advances accessible, multilingual medical AI and provides datasets, benchmarks, and code to facilitate further research while acknowledging safety, ethical, and deployment considerations for clinical use.

Abstract

We introduce BiMediX2, a bilingual (Arabic-English) Bio-Medical EXpert Large Multimodal Model that supports text-based and image-based medical interactions. It enables multi-turn conversation in Arabic and English and supports diverse medical imaging modalities, including radiology, CT, and histology. To train BiMediX2, we curate BiMed-V, an extensive Arabic-English bilingual healthcare dataset consisting of 1.6M samples of diverse medical interactions. This dataset supports a range of medical Large Language Model (LLM) and Large Multimodal Model (LMM) tasks, including multi-turn medical conversations, report generation, and visual question answering (VQA). We also introduce BiMed-MBench, the first Arabic-English medical LMM evaluation benchmark, verified by medical experts. BiMediX2 demonstrates excellent performance across multiple medical LLM and LMM benchmarks, achieving state-of-the-art results compared to other open-sourced models. On BiMed-MBench, BiMediX2 outperforms existing methods by over 9% in English and more than 20% in Arabic evaluations. Additionally, it surpasses GPT-4 by approximately 9% in UPHILL factual accuracy evaluations and excels in various medical VQA, report generation, and report summarization tasks. Our trained models, instruction set, and source code are available at https://github.com/mbzuai-oryx/BiMediX2

Paper Structure

This paper contains 27 sections, 11 figures, 13 tables.

Figures (11)

  • Figure 1: Performance comparison on BiMed-MBench. The comparison is conducted across different tasks and modalities, including CT, MRI, CXR, Histology, and Gross, along with their Arabic counterparts (CT_ar, MRI_ar, CXR_ar, Histology_ar, and Gross_ar). Each axis represents the performance score for a specific category, highlighting BiMediX2’s superior performance across diverse tasks and modalities in both English and Arabic.
  • Figure 2: BiMediX2: Overall Architecture Our model is designed for medical image analysis and bilingual multi-turn conversations. Medical images are processed through a Vision Encoder and aligned with a Projector, while the text inputs are tokenized using the default tokenizer. The resulting tokens are then passed into the language model (Meta Llama 3.1) to generate responses in the prompted language. We only train the language model using LoRA adapters, while the projector is finetuned for medical image-text alignment. BiMediX2 follows a two-stage training pipeline. Stage-1 aligns medical visual concepts using 467K image-caption pairs. Stage-2 performs multimodal medical instruction tuning with our proposed BiMed-V 1.6M bilingual instructions comprising both image-text and text-only medical instructions.
  • Figure 3: Qualitative Examples of BiMediX2 for Medical Image Understanding in a Conversational Context.
  • Figure 4: Performance comparison on UPHILL OpenQAkaur2023evaluating, assessing the model's ability to address false medical claims at different presupposition levels.
  • Figure 5: Data Translation Framework
  • ...and 6 more figures