Table of Contents
Fetching ...

Grounded by Experience: Generative Healthcare Prediction Augmented with Hierarchical Agentic Retrieval

Chuang Zhao, Hui Tang, Hongke Zhao, Xiaofang Zhou, Xiaomeng Li

TL;DR

GHAR addresses hallucination and retrieval-activation timing in healthcare prediction by introducing a hierarchical agentic RAG framework with dual agents that iteratively decide when to retrieve and how to retrieve content. The system unifies two agents within a Markov Decision Process and optimizes them with multi-agent reinforcement learning, guided by a diverse reward structure that aligns reasoning efficiency, retrieval relevance, and final accuracy. Using meta-path partitions to constrain GraphRAG, GHAR demonstrates superior performance on three healthcare benchmarks (DEC Pred, READ Pred, LOS Pred) across MIMIC-III, MIMIC-IV, and eICU, with robust ablations, OOD evaluations, and semantic QA demonstrations. The work advances practical healthcare AI by enabling dynamic, explainable, and scalable augmentation of LLM predictions with targeted external knowledge, potentially reducing hallucinations and improving generalization in clinical decision support.

Abstract

Accurate healthcare prediction is critical for improving patient outcomes and reducing operational costs. Bolstered by growing reasoning capabilities, large language models (LLMs) offer a promising path to enhance healthcare predictions by drawing on their rich parametric knowledge. However, LLMs are prone to factual inaccuracies due to limitations in the reliability and coverage of their embedded knowledge. While retrieval-augmented generation (RAG) frameworks, such as GraphRAG and its variants, have been proposed to mitigate these issues by incorporating external knowledge, they face two key challenges in the healthcare scenario: (1) identifying the clinical necessity to activate the retrieval mechanism, and (2) achieving synergy between the retriever and the generator to craft contextually appropriate retrievals. To address these challenges, we propose GHAR, a \underline{g}enerative \underline{h}ierarchical \underline{a}gentic \underline{R}AG framework that simultaneously resolves when to retrieve and how to optimize the collaboration between submodules in healthcare. Specifically, for the first challenge, we design a dual-agent architecture comprising Agent-Top and Agent-Low. Agent-Top acts as the primary physician, iteratively deciding whether to rely on parametric knowledge or to initiate retrieval, while Agent-Low acts as the consulting service, summarising all task-relevant knowledge once retrieval was triggered. To tackle the second challenge, we innovatively unify the optimization of both agents within a formal Markov Decision Process, designing diverse rewards to align their shared goal of accurate prediction while preserving their distinct roles. Extensive experiments on three benchmark datasets across three popular tasks demonstrate our superiority over state-of-the-art baselines, highlighting the potential of hierarchical agentic RAG in advancing healthcare systems.

Grounded by Experience: Generative Healthcare Prediction Augmented with Hierarchical Agentic Retrieval

TL;DR

GHAR addresses hallucination and retrieval-activation timing in healthcare prediction by introducing a hierarchical agentic RAG framework with dual agents that iteratively decide when to retrieve and how to retrieve content. The system unifies two agents within a Markov Decision Process and optimizes them with multi-agent reinforcement learning, guided by a diverse reward structure that aligns reasoning efficiency, retrieval relevance, and final accuracy. Using meta-path partitions to constrain GraphRAG, GHAR demonstrates superior performance on three healthcare benchmarks (DEC Pred, READ Pred, LOS Pred) across MIMIC-III, MIMIC-IV, and eICU, with robust ablations, OOD evaluations, and semantic QA demonstrations. The work advances practical healthcare AI by enabling dynamic, explainable, and scalable augmentation of LLM predictions with targeted external knowledge, potentially reducing hallucinations and improving generalization in clinical decision support.

Abstract

Accurate healthcare prediction is critical for improving patient outcomes and reducing operational costs. Bolstered by growing reasoning capabilities, large language models (LLMs) offer a promising path to enhance healthcare predictions by drawing on their rich parametric knowledge. However, LLMs are prone to factual inaccuracies due to limitations in the reliability and coverage of their embedded knowledge. While retrieval-augmented generation (RAG) frameworks, such as GraphRAG and its variants, have been proposed to mitigate these issues by incorporating external knowledge, they face two key challenges in the healthcare scenario: (1) identifying the clinical necessity to activate the retrieval mechanism, and (2) achieving synergy between the retriever and the generator to craft contextually appropriate retrievals. To address these challenges, we propose GHAR, a \underline{g}enerative \underline{h}ierarchical \underline{a}gentic \underline{R}AG framework that simultaneously resolves when to retrieve and how to optimize the collaboration between submodules in healthcare. Specifically, for the first challenge, we design a dual-agent architecture comprising Agent-Top and Agent-Low. Agent-Top acts as the primary physician, iteratively deciding whether to rely on parametric knowledge or to initiate retrieval, while Agent-Low acts as the consulting service, summarising all task-relevant knowledge once retrieval was triggered. To tackle the second challenge, we innovatively unify the optimization of both agents within a formal Markov Decision Process, designing diverse rewards to align their shared goal of accurate prediction while preserving their distinct roles. Extensive experiments on three benchmark datasets across three popular tasks demonstrate our superiority over state-of-the-art baselines, highlighting the potential of hierarchical agentic RAG in advancing healthcare systems.

Paper Structure

This paper contains 23 sections, 18 equations, 16 figures, 7 tables, 1 algorithm.

Figures (16)

  • Figure 1: Motivation Difference. (a) Forecasting using only LLM parameterized knowledge. (b) Single-round retrieve augmented generation with an external knowledge graph (KG). (c) Our idea utilizes hierarchical agents for iterative generation.
  • Figure 2: Overview of GHAR. (a) Outline of the pipeline for each iteration, potentially involving LLM or LLM+RAG paths. (b) The agent's state is determined by the initial query and all historical reasoning paths. (c) For Agent-Top, it is essential to determine both whether to terminate the process and when to trigger retrieval. (d) For Agent-Low, it summarizes extracted external knowledge to produce a task-relevant response. (e) The diverse rewards design includes cost reduction, format standardization, and accuracy, as well as ranking, to maintain the role distinction and collaborative dynamics between the two agents. ORM denotes the outcome-supervised reward deepseek.
  • Figure 3: Comparison under Diverse Retrievers. We employ the popular E5 e5, BGE-M3 bge3, and Clinical-BERT wang2023optimized.
  • Figure 4: Comparison under Diverse LLMs. We employ Qwen2.5-3B with rank 8 (our method), Qwen2.5-3B with rank 16 qwen2.5, and BioMistral-7B biomistral.
  • Figure 5: Diverse Training Settings. Please note that in KARE, DEC Pred denotes in-hospital mortality, not within 24h.
  • ...and 11 more figures