When Does RL Help Medical VLMs? Disentangling Vision, SFT, and RL Gains

Ahmadreza Jeddi; Kimia Shaban; Negin Baghbanzadeh; Natasha Sharan; Abhishek Moturu; Elham Dolatabadi; Babak Taati

When Does RL Help Medical VLMs? Disentangling Vision, SFT, and RL Gains

Ahmadreza Jeddi, Kimia Shaban, Negin Baghbanzadeh, Natasha Sharan, Abhishek Moturu, Elham Dolatabadi, Babak Taati

TL;DR

A controlled study that disentangles effects along three axes: vision, SFT, and RL finds that RL is most effective when the model already has non-trivial support (high Pass@K): it primarily sharpens the output distribution, improving Acc@1 and sampling efficiency, while SFT expands support and makes RL effective.

Abstract

Reinforcement learning (RL) is increasingly used to post-train medical Vision-Language Models (VLMs), yet it remains unclear whether RL improves medical visual reasoning or mainly sharpens behaviors already induced by supervised fine-tuning (SFT). We present a controlled study that disentangles these effects along three axes: vision, SFT, and RL. Using MedMNIST as a multi-modality testbed, we probe visual perception by benchmarking VLM vision towers against vision-only baselines, quantify reasoning support and sampling efficiency via Accuracy@1 versus Pass@K, and evaluate when RL closes the support gap and how gains transfer across modalities. We find that RL is most effective when the model already has non-trivial support (high Pass@K): it primarily sharpens the output distribution, improving Acc@1 and sampling efficiency, while SFT expands support and makes RL effective. Based on these findings, we propose a boundary-aware recipe and instantiate it by RL post-training an OctoMed-initialized model on a small, balanced subset of PMC multiple-choice VQA, achieving strong average performance across six medical VQA benchmarks.

When Does RL Help Medical VLMs? Disentangling Vision, SFT, and RL Gains

TL;DR

Abstract

Paper Structure (19 sections, 1 equation, 3 figures, 3 tables)

This paper contains 19 sections, 1 equation, 3 figures, 3 tables.

Introduction
Contributions.
Related Work
Medical VLMs and post-training.
Does RL add reasoning beyond the base model?
Disentangling Vision, SFT, and RL in Medical VLMs
Testbed.
Models.
RQ1: How Strong Are the Visual Representations in Medical VLMs?
RQ2: What Is the Reasoning-Capacity of Medical VLMs?
RQ3: When Does RL Help Medical VLMs?
Setup.
Results.
From Analysis to Practice: A Recipe for RL Post-Training
Support and sharpening.
...and 4 more sections

Figures (3)

Figure 1: Pass@K curves on MedMNIST-v2, grouped by modality.
Figure 2: Before/after RL changes in Acc@1 and Pass@16 from $M_{\text{Base}}$ and $M_{\text{SFT}}$ across in-domain, within-modality, and cross-modality evaluations.
Figure 3: Overview of our boundary-aware recipe. We first diagnose support using Pass@K and Acc@1 then decide between bridging versus RL sharpening.

When Does RL Help Medical VLMs? Disentangling Vision, SFT, and RL Gains

TL;DR

Abstract

When Does RL Help Medical VLMs? Disentangling Vision, SFT, and RL Gains

Authors

TL;DR

Abstract

Table of Contents

Figures (3)