Table of Contents
Fetching ...

Toward Effective Reinforcement Learning Fine-Tuning for Medical VQA in Vision-Language Models

Wenhui Zhu, Xuanzhao Dong, Xin Li, Peijie Qiu, Xiwen Chen, Abolfazl Razi, Aris Sotiras, Yi Su, Yalin Wang

TL;DR

This work evaluates GRPO-based reinforcement learning for medical multimodal language models in the context of visual question answering. It systematically analyzes initialization strategies (scratch vs instruction-tuned), medical semantic alignment, long-chain reasoning incentives, and normalization biases, showing that semantic alignment and unbiased GRPO yield meaningful performance gains over standard supervised fine-tuning. Key findings indicate that purely length-based rewards can degrade factual accuracy, while unbiased GRPO and medical semantic rewards improve both accuracy and clinically grounded reasoning. The study suggests that GRPO-based RL holds promise for developing more trustworthy and efficient medical MLLMs, with practical implications for clinical VQA systems and beyond.

Abstract

Recently, reinforcement learning (RL)-based tuning has shifted the trajectory of Multimodal Large Language Models (MLLMs), particularly following the introduction of Group Relative Policy Optimization (GRPO). However, directly applying it to medical tasks remains challenging for achieving clinically grounded model behavior. Motivated by the need to align model response with clinical expectations, we investigate four critical dimensions that affect the effectiveness of RL-based tuning in medical visual question answering (VQA): base model initialization strategy, the role of medical semantic alignment, the impact of length-based rewards on long-chain reasoning, and the influence of bias. We conduct extensive experiments to analyze these factors for medical MLLMs, providing new insights into how models are domain-specifically fine-tuned. Additionally, our results also demonstrate that GRPO-based RL tuning consistently outperforms standard supervised fine-tuning (SFT) in both accuracy and reasoning quality.

Toward Effective Reinforcement Learning Fine-Tuning for Medical VQA in Vision-Language Models

TL;DR

This work evaluates GRPO-based reinforcement learning for medical multimodal language models in the context of visual question answering. It systematically analyzes initialization strategies (scratch vs instruction-tuned), medical semantic alignment, long-chain reasoning incentives, and normalization biases, showing that semantic alignment and unbiased GRPO yield meaningful performance gains over standard supervised fine-tuning. Key findings indicate that purely length-based rewards can degrade factual accuracy, while unbiased GRPO and medical semantic rewards improve both accuracy and clinically grounded reasoning. The study suggests that GRPO-based RL holds promise for developing more trustworthy and efficient medical MLLMs, with practical implications for clinical VQA systems and beyond.

Abstract

Recently, reinforcement learning (RL)-based tuning has shifted the trajectory of Multimodal Large Language Models (MLLMs), particularly following the introduction of Group Relative Policy Optimization (GRPO). However, directly applying it to medical tasks remains challenging for achieving clinically grounded model behavior. Motivated by the need to align model response with clinical expectations, we investigate four critical dimensions that affect the effectiveness of RL-based tuning in medical visual question answering (VQA): base model initialization strategy, the role of medical semantic alignment, the impact of length-based rewards on long-chain reasoning, and the influence of bias. We conduct extensive experiments to analyze these factors for medical MLLMs, providing new insights into how models are domain-specifically fine-tuned. Additionally, our results also demonstrate that GRPO-based RL tuning consistently outperforms standard supervised fine-tuning (SFT) in both accuracy and reasoning quality.

Paper Structure

This paper contains 17 sections, 3 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Illustration of the prompt template used to evaluate the effectiveness of medical semantic alignment. See more details in Sec. \ref{['Sec:semantic-alignment']}.
  • Figure S2: Visual comparison of reasoning outputs on two medical imaging questions. Red highlights indicate incorrect answers, while green highlights indicate correct answers. Although training Qwen2-VL-2B from scratch with GRPO-based RL tuning model generates longer sequences, its reasoning is often redundant and inaccurate. GRPO-based RL tuning based on Qwen2-VL-2B-Instruct produces more concise and clinically accurate reasoning, leading to correct answers.
  • Figure S3: Examples showing that medical alignment improves visual reasoning. Correct answers are shown in green, incorrect in red, and medical knowledge is highlighted in yellow. With medical alignment, the model produces more accurate and informed responses by grounding its reasoning in domain-specific knowledge.
  • Figure S4: Examples of incorrect but verbose reasoning in long-chain answers. Although the model generates extensive intermediate thinking steps, the reasoning is often repetitive, includes irrelevant details, and ultimately leads to an incorrect answer.
  • Figure S5: Comparison between the original Qwen2-VL-2B-Instruct and its LoRA fine-tuned variant. While the original model generates step-by-step visual reasoning to support its prediction, the LoRA-SFT version directly outputs the answer without any intermediate explanation.