Table of Contents
Fetching ...

Evaluating an evidence-guided reinforcement learning framework in aligning light-parameter large language models with decision-making cognition in psychiatric clinical reasoning

Xinxin Lin, Guangxin Dai, Yi Zhong, Xiang Li, Xue Xiao, Yixin Zhang, Zhengdong Wu, Yongbo Zheng, Runchuan Zhu, Ming Zhao, Huizi Yu, Shuo Wu, Jun Zhao, Lingming Hu, Yumei Wang, Ping Yin, Joey W. Y. Chan, Ngan Yin Chan, Sijing Chen, Yun Kwok Wing, Lin Lu, Xin Ma, Lizhou Fan

TL;DR

Problem: hallucinations and misalignment in psychiatry limit AI-assisted decision support, especially for privacy-preserving light-parameter LLMs. Approach: ClinMPO couples a policy model with ClinRM, a reward model trained on an Evidence Dataset organized by the Oxford Centre for Evidence-Based Medicine hierarchy, and multi-group policy optimization; the final scalar reward is $R = \max(0, R_{raw})$ with $R_{raw} = \sum_j s_j + s_{C2} + s_{C3}$. Findings: ClinMPO-enabled Qwen3-8B achieves 31.43% diagnostic accuracy, surpassing the medical student benchmark of 30.84%, and shows robust gains across 26 ICD-11 categories and 12 psychiatric competencies; mean accuracy gains across scales are about 2.72 percentage points, with notable net reasoning corrections (e.g., +98) at the 4B scale. Significance: demonstrates that explicit cognitive alignment via evidence-based RL enables lightweight LLMs to approach or exceed clinician-level diagnostic reasoning, enabling privacy-preserving, scalable psychiatric decision support, while outlining limitations and directions for future work such as pretraining alignment and deployment considerations.

Abstract

Large language models (LLMs) hold transformative potential for medical decision support yet their application in psychiatry remains constrained by hallucinations and superficial reasoning. This limitation is particularly acute in light-parameter LLMs which are essential for privacy-preserving and efficient clinical deployment. Existing training paradigms prioritize linguistic fluency over structured clinical logic and result in a fundamental misalignment with professional diagnostic cognition. Here we introduce ClinMPO, a reinforcement learning framework designed to align the internal reasoning of LLMs with professional psychiatric practice. The framework employs a specialized reward model trained independently on a dataset derived from 4,474 psychiatry journal articles and structured according to evidence-based medicine principles. We evaluated ClinMPO on a unseen subset of the benchmark designed to isolate reasoning capabilities from rote memorization. This test set comprises items where leading large-parameter LLMs consistently fail. We compared the ClinMPO-aligned light LLM performance against a cohort of 300 medical students. The ClinMPO-tuned Qwen3-8B model achieved a diagnostic accuracy of 31.4% and surpassed the human benchmark of 30.8% on these complex cases. These results demonstrate that medical evidence-guided optimization enables light-parameter LLMs to master complex reasoning tasks. Our findings suggest that explicit cognitive alignment offers a scalable pathway to reliable and safe psychiatric decision support.

Evaluating an evidence-guided reinforcement learning framework in aligning light-parameter large language models with decision-making cognition in psychiatric clinical reasoning

TL;DR

Problem: hallucinations and misalignment in psychiatry limit AI-assisted decision support, especially for privacy-preserving light-parameter LLMs. Approach: ClinMPO couples a policy model with ClinRM, a reward model trained on an Evidence Dataset organized by the Oxford Centre for Evidence-Based Medicine hierarchy, and multi-group policy optimization; the final scalar reward is with . Findings: ClinMPO-enabled Qwen3-8B achieves 31.43% diagnostic accuracy, surpassing the medical student benchmark of 30.84%, and shows robust gains across 26 ICD-11 categories and 12 psychiatric competencies; mean accuracy gains across scales are about 2.72 percentage points, with notable net reasoning corrections (e.g., +98) at the 4B scale. Significance: demonstrates that explicit cognitive alignment via evidence-based RL enables lightweight LLMs to approach or exceed clinician-level diagnostic reasoning, enabling privacy-preserving, scalable psychiatric decision support, while outlining limitations and directions for future work such as pretraining alignment and deployment considerations.

Abstract

Large language models (LLMs) hold transformative potential for medical decision support yet their application in psychiatry remains constrained by hallucinations and superficial reasoning. This limitation is particularly acute in light-parameter LLMs which are essential for privacy-preserving and efficient clinical deployment. Existing training paradigms prioritize linguistic fluency over structured clinical logic and result in a fundamental misalignment with professional diagnostic cognition. Here we introduce ClinMPO, a reinforcement learning framework designed to align the internal reasoning of LLMs with professional psychiatric practice. The framework employs a specialized reward model trained independently on a dataset derived from 4,474 psychiatry journal articles and structured according to evidence-based medicine principles. We evaluated ClinMPO on a unseen subset of the benchmark designed to isolate reasoning capabilities from rote memorization. This test set comprises items where leading large-parameter LLMs consistently fail. We compared the ClinMPO-aligned light LLM performance against a cohort of 300 medical students. The ClinMPO-tuned Qwen3-8B model achieved a diagnostic accuracy of 31.4% and surpassed the human benchmark of 30.8% on these complex cases. These results demonstrate that medical evidence-guided optimization enables light-parameter LLMs to master complex reasoning tasks. Our findings suggest that explicit cognitive alignment offers a scalable pathway to reliable and safe psychiatric decision support.
Paper Structure (23 sections, 10 equations, 8 figures)

This paper contains 23 sections, 10 equations, 8 figures.

Figures (8)

  • Figure 1: Overview of the ClinMPO framework.a Data construction pipeline for Public Dataset and Evidence Dataset. b Illustration of the ClinMPO algorithm. Candidate responses are scored by a reward model (ClinM) trained on the Evidence dataset to mimic psychiatrist ratings. Group-based reward and advantage calculations are then used to optimize the policy. c Model performance is evaluated using a two-level clinical classification scheme and compared with human performance, using outputs from multiple models and medical students on a held-out test set.
  • Figure 2: Performance of medical students, base models, and fine‑tuned models trained with different pipelines on test set, stratified by Two-tiered Clinical Categorization. Dots in different colors represent the accuracy of different models for each category, while the red diamond denotes the accuracy of human medical students on the corresponding category set. a Results stratified by the ICD‑11 diagnostic taxonomy, b Results stratified by psychiatric practice competencies.
  • Figure 3: Model accuracy and reasoning transition analysis across scales.a Net reasoning transitions (false-to-true minus true-to-false, FT - TF) for each model scale and training strategy. b Human(medical students) performance and Overall diagnostic accuracy of Qwen3 models at four parameter scales (0.6B, 1.7B, 4B, and 8B) under Base, SFT, GRPO, and ClinMPO training strategies.
  • Figure 4: Overall accuracy distribution across the Two-tiered Clinical Categorization. This chart compares the distribution of human(medicine students) accuracy (in red) against the performance of models trained under different paradigms (shown in other colors).
  • Figure 5: Pipeline for data construction pipeline.a Public QA Dataset construction b Evidence Dataset construction
  • ...and 3 more figures