Table of Contents
Fetching ...

Advancing Healthcare Automation: Multi-Agent System for Medical Necessity Justification

Himanshu Pandey, Akhil Amod, Shivang

TL;DR

This work tackles the healthcare PA bottleneck by introducing a multi-agent system of specialized LLMs that split the medical-necessity determination into leaf-level checks and bottom-up aggregation. The leaf-stage uses retrieval-augmented evidence selection, evidence classification, and jury-style voting to output item judgments with explanations, while a Propagator Agent performs logical-operator based parent-node fusion to yield a final root decision. Across MIMIC-III-based data and clinical guidelines, GPT-4 emerges as the strongest performer, achieving approximately $86.2\%$ leaf-item accuracy and $95.6\%$ root-checklist accuracy, with CoT prompting and ICL enhancing performance for smaller models. The study also emphasizes explainability through evidenced rationales and chain-of-thought prompts, and outlines a scalable, microservice-architecture path toward deployable PA automation and broader clinical decision-support applications.

Abstract

Prior Authorization delivers safe, appropriate, and cost-effective care that is medically justified with evidence-based guidelines. However, the process often requires labor-intensive manual comparisons between patient medical records and clinical guidelines, that is both repetitive and time-consuming. Recent developments in Large Language Models (LLMs) have shown potential in addressing complex medical NLP tasks with minimal supervision. This paper explores the application of Multi-Agent System (MAS) that utilize specialized LLM agents to automate Prior Authorization task by breaking them down into simpler and manageable sub-tasks. Our study systematically investigates the effects of various prompting strategies on these agents and benchmarks the performance of different LLMs. We demonstrate that GPT-4 achieves an accuracy of 86.2% in predicting checklist item-level judgments with evidence, and 95.6% in determining overall checklist judgment. Additionally, we explore how these agents can contribute to explainability of steps taken in the process, thereby enhancing trust and transparency in the system.

Advancing Healthcare Automation: Multi-Agent System for Medical Necessity Justification

TL;DR

This work tackles the healthcare PA bottleneck by introducing a multi-agent system of specialized LLMs that split the medical-necessity determination into leaf-level checks and bottom-up aggregation. The leaf-stage uses retrieval-augmented evidence selection, evidence classification, and jury-style voting to output item judgments with explanations, while a Propagator Agent performs logical-operator based parent-node fusion to yield a final root decision. Across MIMIC-III-based data and clinical guidelines, GPT-4 emerges as the strongest performer, achieving approximately leaf-item accuracy and root-checklist accuracy, with CoT prompting and ICL enhancing performance for smaller models. The study also emphasizes explainability through evidenced rationales and chain-of-thought prompts, and outlines a scalable, microservice-architecture path toward deployable PA automation and broader clinical decision-support applications.

Abstract

Prior Authorization delivers safe, appropriate, and cost-effective care that is medically justified with evidence-based guidelines. However, the process often requires labor-intensive manual comparisons between patient medical records and clinical guidelines, that is both repetitive and time-consuming. Recent developments in Large Language Models (LLMs) have shown potential in addressing complex medical NLP tasks with minimal supervision. This paper explores the application of Multi-Agent System (MAS) that utilize specialized LLM agents to automate Prior Authorization task by breaking them down into simpler and manageable sub-tasks. Our study systematically investigates the effects of various prompting strategies on these agents and benchmarks the performance of different LLMs. We demonstrate that GPT-4 achieves an accuracy of 86.2% in predicting checklist item-level judgments with evidence, and 95.6% in determining overall checklist judgment. Additionally, we explore how these agents can contribute to explainability of steps taken in the process, thereby enhancing trust and transparency in the system.
Paper Structure (20 sections, 6 equations, 11 figures, 2 tables)

This paper contains 20 sections, 6 equations, 11 figures, 2 tables.

Figures (11)

  • Figure 1: Leaf-Level Judgement Prediction where the first agent classifies the documents into supporting and contradictory sets and then the jury agent determines if the checklist item is satisfied.
  • Figure 2: Bottom-Up Judgement Propagation where the agent uses the logical operators contained in a checklist item to determine how the aggregation should take place.
  • Figure 3: An example checklist formatted as a decision tree
  • Figure 4: Annotation Dashboard where each annotator has to mark if the checklist item is True, False or No Information (can't be concluded) and mark evidences for their selection.
  • Figure 5: Recall of Encoder (MiniLM-L6-v2) model for various k-values
  • ...and 6 more figures