Improving Medical Visual Reinforcement Fine-Tuning via Perception and Reasoning Augmentation

Guangjing Yang; ZhangYuan Yu; Ziyuan Qin; Xinyuan Song; Huahui Yi; Qingbo Kang; Jun Gao; Yiyue Li; Chenlin Du; Qicheng Lao

Improving Medical Visual Reinforcement Fine-Tuning via Perception and Reasoning Augmentation

Guangjing Yang, ZhangYuan Yu, Ziyuan Qin, Xinyuan Song, Huahui Yi, Qingbo Kang, Jun Gao, Yiyue Li, Chenlin Du, Qicheng Lao

TL;DR

The paper tackles the challenge of extending visual reinforcement fine-tuning to medical imaging by proposing VRFT-Aug, a framework that augments both perception and reasoning in medical LVLMs. It introduces perception augmentation through explicit task-relevant context in prompts and implicit cross-task localization knowledge, and reasoning augmentation via recitation-based reward shaping and a multi-grade fuzzy reward for ordinal classification. Empirical results across eight MedMNIST datasets show consistent improvements over supervised fine-tuning and baseline V‑RFT, with notable gains from localization-informed perception and from MFRS in sparse-reward settings. The work offers practical guidance and constitutes a foundational step toward reliable, reasoning-enabled medical visual models with RL-based post-training.

Abstract

While recent advances in Reinforcement Fine-Tuning (RFT) have shown that rule-based reward schemes can enable effective post-training for large language models, their extension to cross-modal, vision-centric domains remains largely underexplored. This limitation is especially pronounced in the medical imaging domain, where effective performance requires both robust visual perception and structured reasoning. In this work, we address this gap by proposing VRFT-Aug, a visual reinforcement fine-tuning framework tailored for the medical domain. VRFT-Aug introduces a series of training strategies designed to augment both perception and reasoning, including prior knowledge injection, perception-driven policy refinement, medically informed reward shaping, and behavioral imitation. Together, these methods aim to stabilize and improve the RFT process. Through extensive experiments across multiple medical datasets, we show that our approaches consistently outperform both standard supervised fine-tuning and RFT baselines. Moreover, we provide empirically grounded insights and practical training heuristics that can be generalized to other medical image tasks. We hope this work contributes actionable guidance and fresh inspiration for the ongoing effort to develop reliable, reasoning-capable models for high-stakes medical applications.

Improving Medical Visual Reinforcement Fine-Tuning via Perception and Reasoning Augmentation

TL;DR

Abstract

Paper Structure (27 sections, 11 equations, 4 figures, 3 tables)

This paper contains 27 sections, 11 equations, 4 figures, 3 tables.

Introduction
Related Works
Large Vision Language Models
Reinforcement Learning
Methods
Preliminary
Augmenting Prompt $P$ with Task-Relevant Context
Augmenting Policy Model $\mathbb{\pi_{\theta}}$ with Task Relevant Knowledge
Augmenting Reward $R$ with Recitation Reasoning
Augmenting Reward $R$ with Multi-Grade Fuzzy Approach
Experiments
Setup
Experimental Results
Conclusion
Appendix
...and 12 more sections

Figures (4)

Figure 1: Overview of VRFT-Aug. VRFT-Aug incorporates enhancements from both Perception and Reasoning perspectives, introducing four improvement strategies for medical vision tasks: Augmenting Prompt ($PA_p$), Augmenting Policy Model ($PA_\pi$), Recitation Reasoning ($R_\text{recite}$), and Multi-Grade Fuzzy Reward ($R_\text{MFRS}$).
Figure 2: The effectiveness of our proposed perception augmentation on the prompt.
Figure 3: Performance comparison of different methods on the HAM10000 and HEEL. (a) and (b) show that VRFT + PA${\pi}$ achieves the highest accuracy, with a +35.30% improvement on HAM10000. (c) demonstrates that performance of VRFT + PA${\pi}$ improves with increasing training samples, reflecting enhanced perception capabilities. VSFT + PA${\pi}$ and VRFT + PA${\pi}$ are trained on bounding box prediction tasks (using SFT and GRPO, respectively) and evaluated on classification in a zero-shot manner, while V-SFT and V-RFT are directly trained for classification without localization.
Figure 4: Performance variation on BloodMNIST of different Recitation Reward settings.

Improving Medical Visual Reinforcement Fine-Tuning via Perception and Reasoning Augmentation

TL;DR

Abstract

Improving Medical Visual Reinforcement Fine-Tuning via Perception and Reasoning Augmentation

Authors

TL;DR

Abstract

Table of Contents

Figures (4)