MediX-R1: Open Ended Medical Reinforcement Learning

Sahal Shaji Mullappilly; Mohammed Irfan Kurpath; Omair Mohamed; Mohamed Zidan; Fahad Khan; Salman Khan; Rao Anwer; Hisham Cholakkal

MediX-R1: Open Ended Medical Reinforcement Learning

Sahal Shaji Mullappilly, Mohammed Irfan Kurpath, Omair Mohamed, Mohamed Zidan, Fahad Khan, Salman Khan, Rao Anwer, Hisham Cholakkal

TL;DR

The results demonstrate that open-ended RL with comprehensive reward signals and LLM-based evaluation is a practical path toward reliable medical reasoning in multimodal models.

Abstract

We introduce MediX-R1, an open-ended Reinforcement Learning (RL) framework for medical multimodal large language models (MLLMs) that enables clinically grounded, free-form answers beyond multiple-choice formats. MediX-R1 fine-tunes a baseline vision-language backbone with Group Based RL and a composite reward tailored for medical reasoning: an LLM-based accuracy reward that judges semantic correctness with a strict YES/NO decision, a medical embedding-based semantic reward to capture paraphrases and terminology variants, and lightweight format and modality rewards that enforce interpretable reasoning and modality recognition. This multi-signal design provides stable, informative feedback for open-ended outputs where traditional verifiable or MCQ-only rewards fall short. To measure progress, we propose a unified evaluation framework for both text-only and image+text tasks that uses a Reference-based LLM-as-judge in place of brittle string-overlap metrics, capturing semantic correctness, reasoning, and contextual alignment. Despite using only $\sim51$K instruction examples, MediX-R1 achieves excellent results across standard medical LLM (text-only) and VLM (image + text) benchmarks, outperforming strong open-source baselines and delivering particularly large gains on open-ended clinical tasks. Our results demonstrate that open-ended RL with comprehensive reward signals and LLM-based evaluation is a practical path toward reliable medical reasoning in multimodal models. Our trained models, curated datasets and source code are available at https://medix.cvmbzuai.com

MediX-R1: Open Ended Medical Reinforcement Learning

TL;DR

The results demonstrate that open-ended RL with comprehensive reward signals and LLM-based evaluation is a practical path toward reliable medical reasoning in multimodal models.

Abstract

K instruction examples, MediX-R1 achieves excellent results across standard medical LLM (text-only) and VLM (image + text) benchmarks, outperforming strong open-source baselines and delivering particularly large gains on open-ended clinical tasks. Our results demonstrate that open-ended RL with comprehensive reward signals and LLM-based evaluation is a practical path toward reliable medical reasoning in multimodal models. Our trained models, curated datasets and source code are available at https://medix.cvmbzuai.com

Paper Structure (28 sections, 8 equations, 7 figures, 8 tables)

This paper contains 28 sections, 8 equations, 7 figures, 8 tables.

Introduction
Open Ended Medical RL
Group-based RL with Composite Rewards
Reward Design
Evaluation Framework
Experiments and Results
State-of-the-art Comparisons
Ablation Experiments
Reward Hacking and Mitigation
Human Expert Evaluation
Evaluation on Real World Clinical Data
Qualitative Examples
Conclusion
Appendix
Training Data and Modality Distribution
...and 13 more sections

Figures (7)

Figure 1: Average accuracy across multimodal medical benchmarks vs. training dataset size for recent medical VLMs. Colors denote model families; marker shape/size indicates parameter scale $\sim$(2B, 8B, 30B). × denote open-source availability of training data (*as of 25/02/2026). MediX-R1 8B (68.8%) surpasses MedGemma 27B (68.4%) while using significantly less training data, and MediX-R1 30B achieves the highest overall accuracy (73.6%). All training and evaluation resources are available at https://medix.cvmbzuai.com.
Figure 2: MediX-R1: Overall Architecture The MediX-R1 reinforcement learning framework for open-ended medical reasoning. An input of a medical image and a natural language question is processed by MediX-R1. The model's policy is trained using Group Based RL, which leverages a multi-faceted reward signal. This reward is composed of: a) an LLM-based reward for evaluating the overall quality and correctness of the output; b) an embedding-based reward to ensure semantic alignment; c) a format reward to enforce the desired output structure (<think> and <answer> blocks); and d) a modality reward to ensure the response is grounded in the specified imaging modality. This reward-guided approach encourages the model to generate accurate and interpretable reasoning paths.
Figure 3: Evaluation Framework Our three-stage evaluation pipeline: (1) Generation via vLLM inference on the model under test, (2) Evaluation using Reference-based LLM-as-judge with BASE and MIMIC templates, and (3) Scoring through aggregation of judgment outputs. The framework supports both binary decisions for QA/MCQ tasks and rubric-based scoring for long-form reports, ensuring robust evaluation across diverse medical benchmarks
Figure 4: Qualitative examples of MediX-R1. (Top, Microscopy) Correctly identifies the optic tract in section G with interpretable reasoning. (Bottom, X-ray) Explains why heart size appears smaller in PA vs. AP view. MediX-R1 generates clinically grounded, open-ended answers across modalities.
Figure 5: Overall validation reward vs training step across reward designs. Training with individual signals LLM-only or embedding-only shows volatility and reward hacking, while LLM+embedding reduces but does not eliminate instability. MediX-R1 uses a composite reward which stabilizes learning and delivers the highest final reward and best overall performance.
...and 2 more figures

MediX-R1: Open Ended Medical Reinforcement Learning

TL;DR

Abstract

MediX-R1: Open Ended Medical Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (7)