Process-Supervised Multi-Agent Reinforcement Learning for Reliable Clinical Reasoning
Chaeeun Lee, T. Michael Yates, Pasquale Minervini, T. Ian Simpson
TL;DR
This work proposes a hierarchical Agent-as-Tool reinforcement learning framework for gene–disease validity curation that enforces process-grounded reasoning aligned with ClinGen SOP. A supervisor orchestrates six specialized sub-agents, grounded by a hybrid reward combining outcome accuracy and process fidelity, optimized via Group Relative Policy Optimisation (GRPO). Across ClinGen-derived data, the approach yields strong improvements in process alignment while maintaining high final accuracy, with case studies showing richer, auditable evidence traces. The methodology advances trustworthy clinical AI by delivering structured, category-specific reasoning alongside final decisions, and demonstrates potential applicability to other SOP-governed biomedical tasks.
Abstract
Clinical decision-making requires nuanced reasoning over heterogeneous evidence and traceable justifications. While recent LLM multi-agent systems (MAS) show promise, they largely optimise for outcome accuracy while overlooking process-grounded reasoning aligned with clinical standards. One critical real-world case of this is gene-disease validity curation, where experts must determine whether a gene is causally implicated in a disease by synthesising diverse biomedical evidence. We introduce an agent-as-tool reinforcement learning framework for this task with two objectives: (i) process-level supervision to ensure reasoning follows valid clinical pathways, and (ii) efficient coordination via a hierarchical multi-agent system. Our evaluation on the ClinGen dataset shows that with outcome-only rewards, MAS with a GRPO-trained Qwen3-4B supervisor agent substantially improves final outcome accuracy from 0.195 with a base model supervisor to 0.732, but results in poor process alignment (0.392 F1). Conversely, with process + outcome rewards, MAS with GRPO-trained supervisor achieves higher outcome accuracy (0.750) while significantly improving process fidelity to 0.520 F1. Our code is available at https://github.com/chaeeunlee-io/GeneDiseaseCurationAgents.
