Stabilizing Reasoning in Medical LLMs with Continued Pretraining and Reasoning Preference Optimization
Wataru Kawakami, Keita Suzuki, Junichiro Iwasawa
TL;DR
This work tackles the critical issue of unreliable reasoning in medical LLMs by presenting Preferred-MedLLM-Qwen-72B, a Japanese-medical LLM produced via a two-stage CPT+RPO fine-tuning pipeline on the Qwen2.5-72B base. CPT embeds deep domain knowledge from a Japanese medical corpus, while RPO optimizes reasoning quality and stability by enforcing a preference hierarchy that favors ground-truth explanations over other reasoning paths. The model achieves state-of-the-art IgakuQA accuracy of 0.868, surpassing GPT-4o, and maintains this level when explicitly prompted to explain its reasoning, demonstrating significant improvements in explanation stability. Ablation studies show CPT provides knowledge gains and RPO provides stability, with generalization benefits observed on additional Japanese medical benchmarks. This approach suggests a practical pathway for building trustworthy, explanation-stable LLMs for high-stakes, non-English medical contexts.
Abstract
Large Language Models (LLMs) show potential in medicine, yet clinical adoption is hindered by concerns over factual accuracy, language-specific limitations (e.g., Japanese), and critically, their reliability when required to generate reasoning explanations -- a prerequisite for trust. This paper introduces Preferred-MedLLM-Qwen-72B, a 72B-parameter model optimized for the Japanese medical domain to achieve both high accuracy and stable reasoning. We employ a two-stage fine-tuning process on the Qwen2.5-72B base model: first, Continued Pretraining (CPT) on a comprehensive Japanese medical corpus instills deep domain knowledge. Second, Reasoning Preference Optimization (RPO), a preference-based method, enhances the generation of reliable reasoning pathways while preserving high answer accuracy. Evaluations on the Japanese Medical Licensing Exam benchmark (IgakuQA) show Preferred-MedLLM-Qwen-72B achieves state-of-the-art performance (0.868 accuracy), surpassing strong proprietary models like GPT-4o (0.866). Crucially, unlike baseline or CPT-only models which exhibit significant accuracy degradation (up to 11.5\% and 3.8\% respectively on IgakuQA) when prompted for explanations, our model maintains its high accuracy (0.868) under such conditions. This highlights RPO's effectiveness in stabilizing reasoning generation. This work underscores the importance of optimizing for reliable explanations alongside accuracy. We release the Preferred-MedLLM-Qwen-72B model weights to foster research into trustworthy LLMs for specialized, high-stakes applications.
