Table of Contents
Fetching ...

RPRO: Ranked Preference Reinforcement Optimization for Enhancing Medical QA and Diagnostic Reasoning

Chia-Hsuan Hsu, Jun-En Ding, Hsin-Ling Hsu, Chih-Ho Hsu, Li-Hung Yao, Chun-Chieh Liao, Feng Liu, Fang-Ming Hung

TL;DR

The paper addresses the unreliability of clinical reasoning in medical LLMs by introducing Ranked Preference Reinforcement Optimization (RPRO), a framework that combines task-adaptive chain-of-thought templates with probabilistic, multi-dimensional quality assessment and groupwise ranking via the Bradley–Terry model. RPRO generates multiple CoT candidates, evaluates them on coverage, factual accuracy, and redundancy, and optimizes over full rankings with linear rewards, coupled with KL regularization to stabilize training. Evaluations on PubMedQA, MedQA-USMLE, and FEMH show that a 2B Gemma-based model with RPRO matches or surpasses much larger models across 0-, 1-, and 5-shot settings, with notable gains in macro F1 and semantic coherence. The findings suggest that quality-driven reasoning refinement can yield clinically grounded, scalable LLMs for medical QA and diagnostic support, beyond mere parameter scaling.

Abstract

Medical question answering requires advanced reasoning that integrates domain knowledge with logical inference. However, existing large language models (LLMs) often generate reasoning chains that lack factual accuracy and clinical reliability. We propose Ranked Preference Reinforcement Optimization (RPRO), a novel framework that combines reinforcement learning with preference-driven reasoning refinement to enhance clinical chain-of-thought (CoT) performance. RPRO distinguishes itself from prior approaches by employing task-adaptive reasoning templates and a probabilistic evaluation mechanism that aligns model outputs with established clinical workflows, while automatically identifying and correcting low-quality reasoning chains. Unlike traditional pairwise preference methods, RPRO introduces a groupwise ranking optimization based on the Bradley--Terry model and incorporates KL-divergence regularization for stable training. Experiments on PubMedQA, MedQA-USMLE, and a real-world clinical dataset from Far Eastern Memorial Hospital (FEMH) demonstrate consistent improvements over strong baselines. Remarkably, our 2B-parameter model outperforms much larger 7B--20B models, including medical-specialized variants. These findings demonstrate that combining preference optimization with quality-driven refinement provides a scalable and clinically grounded approach to building more reliable medical LLMs.

RPRO: Ranked Preference Reinforcement Optimization for Enhancing Medical QA and Diagnostic Reasoning

TL;DR

The paper addresses the unreliability of clinical reasoning in medical LLMs by introducing Ranked Preference Reinforcement Optimization (RPRO), a framework that combines task-adaptive chain-of-thought templates with probabilistic, multi-dimensional quality assessment and groupwise ranking via the Bradley–Terry model. RPRO generates multiple CoT candidates, evaluates them on coverage, factual accuracy, and redundancy, and optimizes over full rankings with linear rewards, coupled with KL regularization to stabilize training. Evaluations on PubMedQA, MedQA-USMLE, and FEMH show that a 2B Gemma-based model with RPRO matches or surpasses much larger models across 0-, 1-, and 5-shot settings, with notable gains in macro F1 and semantic coherence. The findings suggest that quality-driven reasoning refinement can yield clinically grounded, scalable LLMs for medical QA and diagnostic support, beyond mere parameter scaling.

Abstract

Medical question answering requires advanced reasoning that integrates domain knowledge with logical inference. However, existing large language models (LLMs) often generate reasoning chains that lack factual accuracy and clinical reliability. We propose Ranked Preference Reinforcement Optimization (RPRO), a novel framework that combines reinforcement learning with preference-driven reasoning refinement to enhance clinical chain-of-thought (CoT) performance. RPRO distinguishes itself from prior approaches by employing task-adaptive reasoning templates and a probabilistic evaluation mechanism that aligns model outputs with established clinical workflows, while automatically identifying and correcting low-quality reasoning chains. Unlike traditional pairwise preference methods, RPRO introduces a groupwise ranking optimization based on the Bradley--Terry model and incorporates KL-divergence regularization for stable training. Experiments on PubMedQA, MedQA-USMLE, and a real-world clinical dataset from Far Eastern Memorial Hospital (FEMH) demonstrate consistent improvements over strong baselines. Remarkably, our 2B-parameter model outperforms much larger 7B--20B models, including medical-specialized variants. These findings demonstrate that combining preference optimization with quality-driven refinement provides a scalable and clinically grounded approach to building more reliable medical LLMs.

Paper Structure

This paper contains 31 sections, 15 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Overview of the proposed pipeline. The framework distinguishes between medical QA and diagnosis tasks, applies chain-of-thought (CoT) reasoning with multi-dimensional quality assessment and probabilistic refinement, and further improves through ranked preference reinforcement optimization (RPRO) with dataset construction.
  • Figure 2: Performance comparison on PubMedQA, MedQA-USMLE, and FEMH under different $\beta$ values. PubMedQA/MedQA report Accuracy (solid blue) and Macro F1 (dashed orange), while FEMH reports BERTScore-F1 (solid blue) and Cosine Similarity (dashed orange).
  • Figure 3: Performance comparison on PubMedQA, MedQA-USMLE, and FEMH under different rollout numbers ($K$). PubMedQA/MedQA report Accuracy and Macro F1; FEMH reports BERTScore-F1 and Cosine Similarity. Each prompt generates $K$ candidate CoTs. For $K<4$, all candidates are used ($M=K$); for $K\ge4$, the top $M=4$ candidates are selected for training.
  • Figure 4: Performance across different acceptance thresholds on PubMedQA, MedQA-USMLE, and FEMH. PubMedQA/MedQA report Accuracy (solid blue) and Macro F1 (dashed purple), while FEMH reports BERTScore-F1 and Cosine Similarity.
  • Figure 5: Training loss curves on MedQA-USMLE, PubMedQA, and FEMH. The plots show KL divergence, pairwise loss, ranking loss, and total loss across training steps for each dataset.
  • ...and 3 more figures