Table of Contents
Fetching ...

RiskAgent: Autonomous Medical AI Copilot for Generalist Risk Prediction

Fenglin Liu, Jinge Wu, Hongjian Zhou, Xiao Gu, Soheila Molaei, Anshul Thakur, Lei Clifton, Honghan Wu, David A. Clifton

TL;DR

RiskAgent addresses generalist medical risk prediction in real-world clinical settings by orchestrating collaboration between LLMs and hundreds of evidence-based clinical tools. Its Decider-Executor-Reviewer framework with an Environment enables tool selection, parameter parsing, tool execution, and justification with a feedback loop, enhancing reliability and interpretability. On the MedRisk benchmark (12,352 questions across 154 diseases, 86 symptoms, 50 specialties, 24 organ systems), an 8B-parameter RiskAgent achieves around 78% overall accuracy (78.34% MedRisk-Qualitative and 76.33% MedRisk-Quantitative) with statistical significance over baselines (p < 0.01), and shows robust generalization to external benchmarks like MEDCALC-BENCH. The approach is open-sourced with model families from 1B to 70B parameters, supporting privacy-friendly deployment and broader adoption in resource-constrained clinical settings while delivering transparent, evidence-based outputs.

Abstract

The application of Large Language Models (LLMs) to various clinical applications has attracted growing research attention. However, real-world clinical decision-making differs significantly from the standardized, exam-style scenarios commonly used in current efforts. In this paper, we present the RiskAgent system to perform a broad range of medical risk predictions, covering over 387 risk scenarios across diverse complex diseases, e.g., cardiovascular disease and cancer. RiskAgent is designed to collaborate with hundreds of clinical decision tools, i.e., risk calculators and scoring systems that are supported by evidence-based medicine. To evaluate our method, we have built the first benchmark MedRisk specialized for risk prediction, including 12,352 questions spanning 154 diseases, 86 symptoms, 50 specialties, and 24 organ systems. The results show that our RiskAgent, with 8 billion model parameters, achieves 76.33% accuracy, outperforming the most recent commercial LLMs, o1, o3-mini, and GPT-4.5, and doubling the 38.39% accuracy of GPT-4o. On rare diseases, e.g., Idiopathic Pulmonary Fibrosis (IPF), RiskAgent outperforms o1 and GPT-4.5 by 27.27% and 45.46% accuracy, respectively. Finally, we further conduct a generalization evaluation on an external evidence-based diagnosis benchmark and show that our RiskAgent achieves the best results. These encouraging results demonstrate the great potential of our solution for diverse diagnosis domains. To improve the adaptability of our model in different scenarios, we have built and open-sourced a family of models ranging from 1 billion to 70 billion parameters. Our code, data, and models are all available at https://github.com/AI-in-Health/RiskAgent.

RiskAgent: Autonomous Medical AI Copilot for Generalist Risk Prediction

TL;DR

RiskAgent addresses generalist medical risk prediction in real-world clinical settings by orchestrating collaboration between LLMs and hundreds of evidence-based clinical tools. Its Decider-Executor-Reviewer framework with an Environment enables tool selection, parameter parsing, tool execution, and justification with a feedback loop, enhancing reliability and interpretability. On the MedRisk benchmark (12,352 questions across 154 diseases, 86 symptoms, 50 specialties, 24 organ systems), an 8B-parameter RiskAgent achieves around 78% overall accuracy (78.34% MedRisk-Qualitative and 76.33% MedRisk-Quantitative) with statistical significance over baselines (p < 0.01), and shows robust generalization to external benchmarks like MEDCALC-BENCH. The approach is open-sourced with model families from 1B to 70B parameters, supporting privacy-friendly deployment and broader adoption in resource-constrained clinical settings while delivering transparent, evidence-based outputs.

Abstract

The application of Large Language Models (LLMs) to various clinical applications has attracted growing research attention. However, real-world clinical decision-making differs significantly from the standardized, exam-style scenarios commonly used in current efforts. In this paper, we present the RiskAgent system to perform a broad range of medical risk predictions, covering over 387 risk scenarios across diverse complex diseases, e.g., cardiovascular disease and cancer. RiskAgent is designed to collaborate with hundreds of clinical decision tools, i.e., risk calculators and scoring systems that are supported by evidence-based medicine. To evaluate our method, we have built the first benchmark MedRisk specialized for risk prediction, including 12,352 questions spanning 154 diseases, 86 symptoms, 50 specialties, and 24 organ systems. The results show that our RiskAgent, with 8 billion model parameters, achieves 76.33% accuracy, outperforming the most recent commercial LLMs, o1, o3-mini, and GPT-4.5, and doubling the 38.39% accuracy of GPT-4o. On rare diseases, e.g., Idiopathic Pulmonary Fibrosis (IPF), RiskAgent outperforms o1 and GPT-4.5 by 27.27% and 45.46% accuracy, respectively. Finally, we further conduct a generalization evaluation on an external evidence-based diagnosis benchmark and show that our RiskAgent achieves the best results. These encouraging results demonstrate the great potential of our solution for diverse diagnosis domains. To improve the adaptability of our model in different scenarios, we have built and open-sourced a family of models ranging from 1 billion to 70 billion parameters. Our code, data, and models are all available at https://github.com/AI-in-Health/RiskAgent.

Paper Structure

This paper contains 13 sections, 3 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: a. The RiskAgent, including three LLM agents (Decider, Executor, and Reviewer) can perform multiple medical risk predictions given the patient's healthcare information. b. Statistics of the MedRisk benchmark, which consists of 12,352 medical risk questions, covering 154 diseases, 86 symptoms, 50 specialties, and 24 organ systems. c. With only 8 billion parameters, our RiskAgent outperforms both existing high-performance general LLMs (i.e., GPT-4o hurst2024gpt4o, o1 jaech2024openaio1, o3-mini gpt2025o3) and state-of-the-art medical LLM (i.e., Meditron-70B) by large margins across different diseases, symptoms, specialties, and organ systems. In contrast to existing LLMs, RiskAgent can collaborate with evidence-based medical tools to not only substantially increase its risk prediction accuracy, but also deliver evidence-based answers. The t-tests between the results from RiskAgent and the best-performing LLMs indicate that the improvement is significant with $p < 0.01$. d. The examples of risk predictions by our method for cancer, cardiac events, and asthma, demonstrate greater accuracy than GPT-4o. The pink- and blue-colored text indicates the incorrect and desirable answers, respectively.
  • Figure 2: Flowchart of the RiskAgent system. a. Data flow in the system. b. Demonstration of the Environment component in the system.
  • Figure 3: Performance for RiskAgent-8B, GPT-4o, and o3-mini. In the boxplot, the central line indicates the median value, while the lower and upper boundaries represent the 25th (Q1) and 75th (Q3) percentiles, respectively. The whiskers extend up to 1.5 times the interquartile range (IQR).
  • Figure 4: The robustness of our method: We evaluate the performance of models on five rare diseases (left) and six types of cancer (right). SM: Systemic Mastocytosis; PM: Primary Myelofibrosis; CML: Chronic Myelogenous Leukemia; IPF: Idiopathic Pulmonary Fibrosis.
  • Figure 5: The generalization ability of our method: We report the overall accuracy of the basic LLMs (blue bars) and the basic LLMs enhanced using our method (red bars). We evaluate a. different variants of the GPT-4o-series LLMs developed at different times and b. LLaMA-3-series LLMs with varying numbers of model parameters. The polyline and the right y-axis show the improvements in different variants. We can see that the more advanced (a) and the larger (b) the basic LLM, the greater the improvements.
  • ...and 1 more figures