RPRO: Ranked Preference Reinforcement Optimization for Enhancing Medical QA and Diagnostic Reasoning
Chia-Hsuan Hsu, Jun-En Ding, Hsin-Ling Hsu, Chih-Ho Hsu, Li-Hung Yao, Chun-Chieh Liao, Feng Liu, Fang-Ming Hung
TL;DR
The paper addresses the unreliability of clinical reasoning in medical LLMs by introducing Ranked Preference Reinforcement Optimization (RPRO), a framework that combines task-adaptive chain-of-thought templates with probabilistic, multi-dimensional quality assessment and groupwise ranking via the Bradley–Terry model. RPRO generates multiple CoT candidates, evaluates them on coverage, factual accuracy, and redundancy, and optimizes over full rankings with linear rewards, coupled with KL regularization to stabilize training. Evaluations on PubMedQA, MedQA-USMLE, and FEMH show that a 2B Gemma-based model with RPRO matches or surpasses much larger models across 0-, 1-, and 5-shot settings, with notable gains in macro F1 and semantic coherence. The findings suggest that quality-driven reasoning refinement can yield clinically grounded, scalable LLMs for medical QA and diagnostic support, beyond mere parameter scaling.
Abstract
Medical question answering requires advanced reasoning that integrates domain knowledge with logical inference. However, existing large language models (LLMs) often generate reasoning chains that lack factual accuracy and clinical reliability. We propose Ranked Preference Reinforcement Optimization (RPRO), a novel framework that combines reinforcement learning with preference-driven reasoning refinement to enhance clinical chain-of-thought (CoT) performance. RPRO distinguishes itself from prior approaches by employing task-adaptive reasoning templates and a probabilistic evaluation mechanism that aligns model outputs with established clinical workflows, while automatically identifying and correcting low-quality reasoning chains. Unlike traditional pairwise preference methods, RPRO introduces a groupwise ranking optimization based on the Bradley--Terry model and incorporates KL-divergence regularization for stable training. Experiments on PubMedQA, MedQA-USMLE, and a real-world clinical dataset from Far Eastern Memorial Hospital (FEMH) demonstrate consistent improvements over strong baselines. Remarkably, our 2B-parameter model outperforms much larger 7B--20B models, including medical-specialized variants. These findings demonstrate that combining preference optimization with quality-driven refinement provides a scalable and clinically grounded approach to building more reliable medical LLMs.
