Elicit and Enhance: Advancing Multimodal Reasoning in Medical Scenarios
Zhongzhen Huang, Linjie Mu, Yakun Zhu, Xiangyu Zhao, Shaoting Zhang, Xiaofan Zhang
TL;DR
This paper tackles the challenge of multimodal reasoning in clinical medicine by proposing $MedE^2$, a two-stage post-training pipeline that first elicits reasoning with text-only demonstrations and then enhances reasoning quality through Multimodal Medical Reasoning Preference (MMRP) and Direct Preference Optimization (DPO). A curated dataset of about 5K high-quality samples (3K text, 2K multimodal) supports Stage-I and Stage-II training. Empirical results across multiple medical benchmarks show that Stage-I yields consistent gains, while Stage-II yields further improvements, enabling open-source models to rival larger proprietary systems and remaining robust under inference-time scaling. The approach emphasizes data quality, structured reasoning, and preference-based alignment to reduce hallucinations, with results generalizing across model sizes and providing open resources for reproducibility and extension.
Abstract
Effective clinical decision-making depends on iterative, multimodal reasoning across diverse sources of evidence. The recent emergence of multimodal reasoning models has significantly transformed the landscape of solving complex tasks. Although such models have achieved notable success in mathematics and science, their application to medical domains remains underexplored. In this work, we propose \textit{MedE$^2$}, a two-stage post-training pipeline that elicits and then enhances multimodal reasoning for medical domains. In Stage-I, we fine-tune models using 2,000 text-only data samples containing precisely orchestrated reasoning demonstrations to elicit reasoning behaviors. In Stage-II, we further enhance the model's reasoning capabilities using 1,500 rigorously curated multimodal medical cases, aligning model reasoning outputs with our proposed multimodal medical reasoning preference. Extensive experiments demonstrate the efficacy and reliability of \textit{MedE$^2$} in improving the reasoning performance of medical multimodal models. Notably, models trained with \textit{MedE$^2$} consistently outperform baselines across multiple medical multimodal benchmarks. Additional validation on larger models and under inference-time scaling further confirms the robustness and practical utility of our approach.
