Table of Contents
Fetching ...

Elicit and Enhance: Advancing Multimodal Reasoning in Medical Scenarios

Zhongzhen Huang, Linjie Mu, Yakun Zhu, Xiangyu Zhao, Shaoting Zhang, Xiaofan Zhang

TL;DR

This paper tackles the challenge of multimodal reasoning in clinical medicine by proposing $MedE^2$, a two-stage post-training pipeline that first elicits reasoning with text-only demonstrations and then enhances reasoning quality through Multimodal Medical Reasoning Preference (MMRP) and Direct Preference Optimization (DPO). A curated dataset of about 5K high-quality samples (3K text, 2K multimodal) supports Stage-I and Stage-II training. Empirical results across multiple medical benchmarks show that Stage-I yields consistent gains, while Stage-II yields further improvements, enabling open-source models to rival larger proprietary systems and remaining robust under inference-time scaling. The approach emphasizes data quality, structured reasoning, and preference-based alignment to reduce hallucinations, with results generalizing across model sizes and providing open resources for reproducibility and extension.

Abstract

Effective clinical decision-making depends on iterative, multimodal reasoning across diverse sources of evidence. The recent emergence of multimodal reasoning models has significantly transformed the landscape of solving complex tasks. Although such models have achieved notable success in mathematics and science, their application to medical domains remains underexplored. In this work, we propose \textit{MedE$^2$}, a two-stage post-training pipeline that elicits and then enhances multimodal reasoning for medical domains. In Stage-I, we fine-tune models using 2,000 text-only data samples containing precisely orchestrated reasoning demonstrations to elicit reasoning behaviors. In Stage-II, we further enhance the model's reasoning capabilities using 1,500 rigorously curated multimodal medical cases, aligning model reasoning outputs with our proposed multimodal medical reasoning preference. Extensive experiments demonstrate the efficacy and reliability of \textit{MedE$^2$} in improving the reasoning performance of medical multimodal models. Notably, models trained with \textit{MedE$^2$} consistently outperform baselines across multiple medical multimodal benchmarks. Additional validation on larger models and under inference-time scaling further confirms the robustness and practical utility of our approach.

Elicit and Enhance: Advancing Multimodal Reasoning in Medical Scenarios

TL;DR

This paper tackles the challenge of multimodal reasoning in clinical medicine by proposing , a two-stage post-training pipeline that first elicits reasoning with text-only demonstrations and then enhances reasoning quality through Multimodal Medical Reasoning Preference (MMRP) and Direct Preference Optimization (DPO). A curated dataset of about 5K high-quality samples (3K text, 2K multimodal) supports Stage-I and Stage-II training. Empirical results across multiple medical benchmarks show that Stage-I yields consistent gains, while Stage-II yields further improvements, enabling open-source models to rival larger proprietary systems and remaining robust under inference-time scaling. The approach emphasizes data quality, structured reasoning, and preference-based alignment to reduce hallucinations, with results generalizing across model sizes and providing open resources for reproducibility and extension.

Abstract

Effective clinical decision-making depends on iterative, multimodal reasoning across diverse sources of evidence. The recent emergence of multimodal reasoning models has significantly transformed the landscape of solving complex tasks. Although such models have achieved notable success in mathematics and science, their application to medical domains remains underexplored. In this work, we propose \textit{MedE}, a two-stage post-training pipeline that elicits and then enhances multimodal reasoning for medical domains. In Stage-I, we fine-tune models using 2,000 text-only data samples containing precisely orchestrated reasoning demonstrations to elicit reasoning behaviors. In Stage-II, we further enhance the model's reasoning capabilities using 1,500 rigorously curated multimodal medical cases, aligning model reasoning outputs with our proposed multimodal medical reasoning preference. Extensive experiments demonstrate the efficacy and reliability of \textit{MedE} in improving the reasoning performance of medical multimodal models. Notably, models trained with \textit{MedE} consistently outperform baselines across multiple medical multimodal benchmarks. Additional validation on larger models and under inference-time scaling further confirms the robustness and practical utility of our approach.

Paper Structure

This paper contains 19 sections, 1 equation, 9 figures, 2 tables.

Figures (9)

  • Figure 1: (a) Current models' performance on diverse tasks. (b) Samples that closely mirror real-world clinical scenarios are used to strengthen multimodal reasoning capabilities instead of samples focused primarily on pattern recognition or basic knowledge recall.
  • Figure 2: Overview of the two-stage post-training recipe MedE$^2$. In Stage-I, text-only data containing reasoning demonstrations is employed to elicit initial reasoning behavior. In Stage-II, Direct Preference Optimization is applied to multimodal data to further enhance reasoning quality.
  • Figure 3: Distribution of human$–$model score differences, with 68.9% falling within $\pm \sigma$, where $\sigma$=1.02.
  • Figure 4: The performance of QwenVL2.5-7B, 32B and 72B on the MedXpertQA-MM benchmark. As the model size increases, the models demonstrate progressively greater benefits from the MedE$^2$.
  • Figure 5: Comparison of performance on Med-XpertQA-MM using various strategies for eliciting reasoning behaviors.
  • ...and 4 more figures