Table of Contents
Fetching ...

PathVLM-R1: A Reinforcement Learning-Driven Reasoning Model for Pathology Visual-Language Tasks

Jianyu Wu, Hao Yang, Xinhua Zeng, Guibing He, Zhiyu Chen, Zihui Li, Xiaochuan Zhang, Yangyang Ma, Run Fang, Yang Liu

TL;DR

PathVLM-R1 tackles the need for interpretable reasoning in pathology visual-language tasks by coupling domain-knowledge infusion through supervised fine-tuning with a dual-reward reinforcement-learning regime. The core innovation, Group Relative Policy Optimization (GRPO), enables efficient, critic-free policy updates, while cross-modal process rewards supervise the reasoning chain alongside final accuracy. Empirical results show notable in-domain gains (65.55% accuracy) and strong out-of-domain generalization, including substantial improvements in dermoscopy transfer, outperforming larger baselines with far more parameters. This work advances reliable, explainable AI for pathology and lays groundwork for broader multi-modality medical imaging applications in precision medicine.

Abstract

The diagnosis of pathological images is often limited by expert availability and regional disparities, highlighting the importance of automated diagnosis using Vision-Language Models (VLMs). Traditional multimodal models typically emphasize outcomes over the reasoning process, compromising the reliability of clinical decisions. To address the weak reasoning abilities and lack of supervised processes in pathological VLMs, we have innovatively proposed PathVLM-R1, a visual language model designed specifically for pathological images. We have based our model on Qwen2.5-VL-7B-Instruct and enhanced its performance for pathological tasks through meticulously designed post-training strategies. Firstly, we conduct supervised fine-tuning guided by pathological data to imbue the model with foundational pathological knowledge, forming a new pathological base model. Subsequently, we introduce Group Relative Policy Optimization (GRPO) and propose a dual reward-driven reinforcement learning optimization, ensuring strict constraint on logical supervision of the reasoning process and accuracy of results via cross-modal process reward and outcome accuracy reward. In the pathological image question-answering tasks, the testing results of PathVLM-R1 demonstrate a 14% improvement in accuracy compared to baseline methods, and it demonstrated superior performance compared to the Qwen2.5-VL-32B version despite having a significantly smaller parameter size. Furthermore, in out-domain data evaluation involving four medical imaging modalities: Computed Tomography (CT), dermoscopy, fundus photography, and Optical Coherence Tomography (OCT) images: PathVLM-R1's transfer performance improved by an average of 17.3% compared to traditional SFT methods. These results clearly indicate that PathVLM-R1 not only enhances accuracy but also possesses broad applicability and expansion potential.

PathVLM-R1: A Reinforcement Learning-Driven Reasoning Model for Pathology Visual-Language Tasks

TL;DR

PathVLM-R1 tackles the need for interpretable reasoning in pathology visual-language tasks by coupling domain-knowledge infusion through supervised fine-tuning with a dual-reward reinforcement-learning regime. The core innovation, Group Relative Policy Optimization (GRPO), enables efficient, critic-free policy updates, while cross-modal process rewards supervise the reasoning chain alongside final accuracy. Empirical results show notable in-domain gains (65.55% accuracy) and strong out-of-domain generalization, including substantial improvements in dermoscopy transfer, outperforming larger baselines with far more parameters. This work advances reliable, explainable AI for pathology and lays groundwork for broader multi-modality medical imaging applications in precision medicine.

Abstract

The diagnosis of pathological images is often limited by expert availability and regional disparities, highlighting the importance of automated diagnosis using Vision-Language Models (VLMs). Traditional multimodal models typically emphasize outcomes over the reasoning process, compromising the reliability of clinical decisions. To address the weak reasoning abilities and lack of supervised processes in pathological VLMs, we have innovatively proposed PathVLM-R1, a visual language model designed specifically for pathological images. We have based our model on Qwen2.5-VL-7B-Instruct and enhanced its performance for pathological tasks through meticulously designed post-training strategies. Firstly, we conduct supervised fine-tuning guided by pathological data to imbue the model with foundational pathological knowledge, forming a new pathological base model. Subsequently, we introduce Group Relative Policy Optimization (GRPO) and propose a dual reward-driven reinforcement learning optimization, ensuring strict constraint on logical supervision of the reasoning process and accuracy of results via cross-modal process reward and outcome accuracy reward. In the pathological image question-answering tasks, the testing results of PathVLM-R1 demonstrate a 14% improvement in accuracy compared to baseline methods, and it demonstrated superior performance compared to the Qwen2.5-VL-32B version despite having a significantly smaller parameter size. Furthermore, in out-domain data evaluation involving four medical imaging modalities: Computed Tomography (CT), dermoscopy, fundus photography, and Optical Coherence Tomography (OCT) images: PathVLM-R1's transfer performance improved by an average of 17.3% compared to traditional SFT methods. These results clearly indicate that PathVLM-R1 not only enhances accuracy but also possesses broad applicability and expansion potential.

Paper Structure

This paper contains 11 sections, 2 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Comparison of the final PathVLM-R1 model and its variant PathVLM-$\beta$. For the same pathology image question-answering task, the PathVLM-R1, which incorporates cross-modal procedural rewards, demonstrates superior rigor in reasoning and accuracy in medical knowledge compared to the PathVLM-$\beta$, which utilizes only accuracy rewards and format rewards. Specifically, in this example, PathVLM-$\beta$ only focuses on differences in nuclear size while neglecting that "pleomorphism" includes multiple aspects of variation, such as nuclear shape and structure. In contrast, PathVLM integrates multiple nuclear features, accurately grasping the definition of "pleomorphism" with a more comprehensive thought process and stronger rigor.
  • Figure 2: Pipeline for cross-modal procedural loss. First, sampling is performed on the old policy to obtain the model's generated thought process and final answers. These, along with the questions, generated content, and evaluation criteria, are then given to GPT-4o for review in terms of reasoning completeness and knowledge correctness, obtaining an Integrity score and Knowledge score. Lastly, GPT-4o's feedback is processed, including error handling and score normalization, using the average of the Integrity score and Knowledge score as the final reward.
  • Figure 3: Various model variants generated during stage-wise training. The entire training set is divided into disjoint segments in the form of 3000, 1000, 1385, used for supervised fine-tuning, reinforcement learning, and its control group, as well as the final performance testing process, respectively. Using Qwen2.5-VL-7B as the base, the model Alpha is obtained with 3000 data points for supervised fine-tuning. Delta and Epsilon are obtained using 1000 data points for supervised fine-tuning/reinforcement learning. Gamma and Beta are obtained by continuing supervised fine-tuning/reinforcement learning with an additional 1000 data points on Alpha. Finally, adding cross-modal procedural losses to Beta results in the final model PathVLM-R1.
  • Figure 4: Changes in model accuracy across different optimization stages. The model's performance is significantly improved across all three training steps.