Table of Contents
Fetching ...

EHRAgent: Code Empowers Large Language Models for Few-shot Complex Tabular Reasoning on Electronic Health Records

Wenqi Shi, Ran Xu, Yuchen Zhuang, Yue Yu, Jieyu Zhang, Hang Wu, Yuanda Zhu, Joyce Ho, Carl Yang, May D. Wang

TL;DR

EHRAgent tackles few-shot multi-tabular reasoning over electronic health records by coupling an LLM agent with a code interface to autonomously generate and execute data queries across EHR tables. It introduces four components—Medical Information Integration, Demonstration Optimization through Long-Term Memory, Interactive Coding with Execution, and Rubber Duck Debugging via Error Tracing—enabling iterative plan refinement through environment feedback. Across three real-world EHR datasets (MIMIC-III, eICU, TREQS), EHRAgent significantly outperforms strong baselines in multi-hop reasoning while reducing dependence on large annotated training data by using only a few demonstrations. The work highlights potential improvements to clinician workflows and acknowledges limitations such as extra execution costs and privacy considerations, suggesting future directions like white-box LLMs and cost-efficient strategies.

Abstract

Large language models (LLMs) have demonstrated exceptional capabilities in planning and tool utilization as autonomous agents, but few have been developed for medical problem-solving. We propose EHRAgent, an LLM agent empowered with a code interface, to autonomously generate and execute code for multi-tabular reasoning within electronic health records (EHRs). First, we formulate an EHR question-answering task into a tool-use planning process, efficiently decomposing a complicated task into a sequence of manageable actions. By integrating interactive coding and execution feedback, EHRAgent learns from error messages and improves the originally generated code through iterations. Furthermore, we enhance the LLM agent by incorporating long-term memory, which allows EHRAgent to effectively select and build upon the most relevant successful cases from past experiences. Experiments on three real-world multi-tabular EHR datasets show that EHRAgent outperforms the strongest baseline by up to 29.6% in success rate. EHRAgent leverages the emerging few-shot learning capabilities of LLMs, enabling autonomous code generation and execution to tackle complex clinical tasks with minimal demonstrations.

EHRAgent: Code Empowers Large Language Models for Few-shot Complex Tabular Reasoning on Electronic Health Records

TL;DR

EHRAgent tackles few-shot multi-tabular reasoning over electronic health records by coupling an LLM agent with a code interface to autonomously generate and execute data queries across EHR tables. It introduces four components—Medical Information Integration, Demonstration Optimization through Long-Term Memory, Interactive Coding with Execution, and Rubber Duck Debugging via Error Tracing—enabling iterative plan refinement through environment feedback. Across three real-world EHR datasets (MIMIC-III, eICU, TREQS), EHRAgent significantly outperforms strong baselines in multi-hop reasoning while reducing dependence on large annotated training data by using only a few demonstrations. The work highlights potential improvements to clinician workflows and acknowledges limitations such as extra execution costs and privacy considerations, suggesting future directions like white-box LLMs and cost-efficient strategies.

Abstract

Large language models (LLMs) have demonstrated exceptional capabilities in planning and tool utilization as autonomous agents, but few have been developed for medical problem-solving. We propose EHRAgent, an LLM agent empowered with a code interface, to autonomously generate and execute code for multi-tabular reasoning within electronic health records (EHRs). First, we formulate an EHR question-answering task into a tool-use planning process, efficiently decomposing a complicated task into a sequence of manageable actions. By integrating interactive coding and execution feedback, EHRAgent learns from error messages and improves the originally generated code through iterations. Furthermore, we enhance the LLM agent by incorporating long-term memory, which allows EHRAgent to effectively select and build upon the most relevant successful cases from past experiences. Experiments on three real-world multi-tabular EHR datasets show that EHRAgent outperforms the strongest baseline by up to 29.6% in success rate. EHRAgent leverages the emerging few-shot learning capabilities of LLMs, enabling autonomous code generation and execution to tackle complex clinical tasks with minimal demonstrations.
Paper Structure (58 sections, 3 equations, 14 figures, 8 tables, 1 algorithm)

This paper contains 58 sections, 3 equations, 14 figures, 8 tables, 1 algorithm.

Figures (14)

  • Figure 1: Simple and efficient interactions between clinicians and EHR systems with the assistance of LLM agents. Clinicians specify tasks in natural language, and the LLM agent autonomously generates and executes code to interact with EHRs (right) for answers. It eliminates the need for specialized expertise or extra effort from data engineers, which is typically required when dealing with EHRs in existing clinical settings (left).
  • Figure 2: Compared to general domain tasks (blue) such as WikiSQL zhong2017seq2sql and SPIDER yu-etal-2018-spider, multi-tabular reasoning tasks within EHRs (orange) typically involve a significantly larger number of records per table and necessitate querying multiple tables to answer each question, thereby requiring more advanced reasoning and problem-solving capabilities.
  • Figure 3: Overview of our proposed LLM agent, EHRAgent, for complex few-shot tabular reasoning tasks on EHRs. Given an input clinical question based on EHRs, EHRAgent decomposes the task and generates a plan (, code) based on (a) metadata (, descriptions of tables and columns in EHRs), (b) tool function definitions, (c) few-shot examples, and (d) domain knowledge (, integrated medical information). Upon execution, EHRAgent iteratively debugs the generated code following the execution errors and ultimately generates the final solution.
  • Figure 4: Success rate and completion rate under different question complexity, measured by the number of elements (, slots) in each question (upper) and the number of columns involved in each solution (bottom).
  • Figure 5: Success rate and completion rate under different numbers of demonstrations.
  • ...and 9 more figures