InfiMed: Low-Resource Medical MLLMs with Advancing Understanding and Reasoning
Zeyu Liu, Zhitian Hou, Guanghao Zhu, Zhijie Sang, Congkai Xie, Hongxia Yang
TL;DR
This work tackles two core problems in medical multimodal LLMs: scarce, information-sparse multimodal medical data and the uncertain benefits of RLVR in medical tasks. It introduces InfiMed-Series, combining a low-resource reflective supervised fine-tuning (SFT) stage that fuses general multimodal data, medical textual data, and reflective-pattern-injected CoT data with a subsequent Reinforcement Learning with Verifiable Rewards (RLVR) stage using Group Relative Policy Optimization (GRPO). The approach yields InfiMed-SFT-3B and InfiMed-RL-3B, achieving state-of-the-art performance among 3B models across seven medical benchmarks (avg accuracies of 57.1% and 59.2%, respectively), with RLVR providing a notable improvement over SFT. Key findings include the importance of diverse data for SFT, the value of reflective CoT for sparse information, and the context-dependent effects of explicit reasoning prompts on medical tasks. Overall, the work demonstrates a data-efficient pathway to high-performing medical MLLMs and offers practical insights into training regimes and reasoning strategies for clinical multimodal AI.
Abstract
Multimodal Large Language Models (MLLMs) have achieved remarkable progress in domains such as visual understanding and mathematical reasoning. However, their application in the medical domain is constrained by two key challenges: (1) multimodal medical datasets are scarce and often contain sparse information, limiting reasoning depth; and (2) Reinforcement Learning with Verifiable Rewards (RLVR), though effective in general domains, cannot reliably improve model performance in the medical domain. To overcome these challenges, during the supervised fine-tuning (SFT) stage, we incorporate high-quality textual reasoning data and general multimodal data alongside multimodal medical data to efficiently enhance foundational medical capabilities and restore the base model's reasoning ability. Moreover, considering that there are some multimodal medical datasets with sparse information, we further synthesize reflective-pattern-injected chain-of-thought (CoT) in addition to general CoT samples, equipping the model with initial reflective reasoning capabilities that provide a structured foundation for subsequent RLVR training. Finally, we introduce our InfiMed-Series models, InfiMed-SFT-3B and InfiMed-RL-3B, both of which deliver state-of-the-art performance across seven multimodal medical benchmarks. Notably, InfiMed-RL-3B achieves an average accuracy of 59.2%, outperforming even larger models like InternVL3-8B, which achieves 57.3%. Specifically, during the SFT phase, we utilized 188K samples, while the RLVR phase incorporated 36K samples, demonstrating the efficacy of both training strategies in achieving superior performance. We also conducted a series of extensive experiments, which provide valuable insights that contribute to advancing the performance of MLLMs in medical scenarios.
