Table of Contents
Fetching ...

Process-Supervised Multi-Agent Reinforcement Learning for Reliable Clinical Reasoning

Chaeeun Lee, T. Michael Yates, Pasquale Minervini, T. Ian Simpson

TL;DR

This work proposes a hierarchical Agent-as-Tool reinforcement learning framework for gene–disease validity curation that enforces process-grounded reasoning aligned with ClinGen SOP. A supervisor orchestrates six specialized sub-agents, grounded by a hybrid reward combining outcome accuracy and process fidelity, optimized via Group Relative Policy Optimisation (GRPO). Across ClinGen-derived data, the approach yields strong improvements in process alignment while maintaining high final accuracy, with case studies showing richer, auditable evidence traces. The methodology advances trustworthy clinical AI by delivering structured, category-specific reasoning alongside final decisions, and demonstrates potential applicability to other SOP-governed biomedical tasks.

Abstract

Clinical decision-making requires nuanced reasoning over heterogeneous evidence and traceable justifications. While recent LLM multi-agent systems (MAS) show promise, they largely optimise for outcome accuracy while overlooking process-grounded reasoning aligned with clinical standards. One critical real-world case of this is gene-disease validity curation, where experts must determine whether a gene is causally implicated in a disease by synthesising diverse biomedical evidence. We introduce an agent-as-tool reinforcement learning framework for this task with two objectives: (i) process-level supervision to ensure reasoning follows valid clinical pathways, and (ii) efficient coordination via a hierarchical multi-agent system. Our evaluation on the ClinGen dataset shows that with outcome-only rewards, MAS with a GRPO-trained Qwen3-4B supervisor agent substantially improves final outcome accuracy from 0.195 with a base model supervisor to 0.732, but results in poor process alignment (0.392 F1). Conversely, with process + outcome rewards, MAS with GRPO-trained supervisor achieves higher outcome accuracy (0.750) while significantly improving process fidelity to 0.520 F1. Our code is available at https://github.com/chaeeunlee-io/GeneDiseaseCurationAgents.

Process-Supervised Multi-Agent Reinforcement Learning for Reliable Clinical Reasoning

TL;DR

This work proposes a hierarchical Agent-as-Tool reinforcement learning framework for gene–disease validity curation that enforces process-grounded reasoning aligned with ClinGen SOP. A supervisor orchestrates six specialized sub-agents, grounded by a hybrid reward combining outcome accuracy and process fidelity, optimized via Group Relative Policy Optimisation (GRPO). Across ClinGen-derived data, the approach yields strong improvements in process alignment while maintaining high final accuracy, with case studies showing richer, auditable evidence traces. The methodology advances trustworthy clinical AI by delivering structured, category-specific reasoning alongside final decisions, and demonstrates potential applicability to other SOP-governed biomedical tasks.

Abstract

Clinical decision-making requires nuanced reasoning over heterogeneous evidence and traceable justifications. While recent LLM multi-agent systems (MAS) show promise, they largely optimise for outcome accuracy while overlooking process-grounded reasoning aligned with clinical standards. One critical real-world case of this is gene-disease validity curation, where experts must determine whether a gene is causally implicated in a disease by synthesising diverse biomedical evidence. We introduce an agent-as-tool reinforcement learning framework for this task with two objectives: (i) process-level supervision to ensure reasoning follows valid clinical pathways, and (ii) efficient coordination via a hierarchical multi-agent system. Our evaluation on the ClinGen dataset shows that with outcome-only rewards, MAS with a GRPO-trained Qwen3-4B supervisor agent substantially improves final outcome accuracy from 0.195 with a base model supervisor to 0.732, but results in poor process alignment (0.392 F1). Conversely, with process + outcome rewards, MAS with GRPO-trained supervisor achieves higher outcome accuracy (0.750) while significantly improving process fidelity to 0.520 F1. Our code is available at https://github.com/chaeeunlee-io/GeneDiseaseCurationAgents.
Paper Structure (33 sections, 15 equations, 8 figures, 6 tables)

This paper contains 33 sections, 15 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Overview of the hierarchical Agent-as-Tool framework for gene–disease validity curation task. The supervisor orchestrates specialised sub-agents to produce both a validity classification and a structured evidence trace matching a clinical SOP.
  • Figure 2: Training trajectories for single-agent and multi-agent settings. In the single-agent setting (left), a single model optionally retrieves full text and directly predicts evidence subtypes and the final validity class. Process reward is applied at the level of evidence subtypes. In the multi-agent setting (right), a supervisor model invokes specialised sub-agents through tool calls, and aggregates their outputs to construct (subtype-level) evidence profile and predicts the final validity class. Process reward is applied at the level of the supervisor’s agent call(s).
  • Figure 3: Supervisor reward trajectories. Reward progression over training steps for (a) outcome-only total reward, (b) hybrid reward, (c) process component of the hybrid reward, and (d) outcome component of the hybrid reward. Longer convergence trajectories and single-agent trajectories are provided in the Appendix \ref{['sec:reward_trajectories']}.
  • Figure 4: Outcome accuracy vs. evidence accuracy for single-agent and multi-agent configurations across model sizes and training settings. The plot illustrates how GRPO training with hybrid reward shifts performance along both axes.
  • Figure 5: Sub-agent performance. Stacked bars show TP, TN, FN, and FP counts for each evidence agent called by supervisor trained with (a) outcome-only and (b) hybrid rewards. TP/FP/FN indicate correct/incorrect subtype predictions; TN indicates correctly predicting absence of evidence for a given category.
  • ...and 3 more figures