EHRAgent: Code Empowers Large Language Models for Few-shot Complex Tabular Reasoning on Electronic Health Records
Wenqi Shi, Ran Xu, Yuchen Zhuang, Yue Yu, Jieyu Zhang, Hang Wu, Yuanda Zhu, Joyce Ho, Carl Yang, May D. Wang
TL;DR
EHRAgent tackles few-shot multi-tabular reasoning over electronic health records by coupling an LLM agent with a code interface to autonomously generate and execute data queries across EHR tables. It introduces four components—Medical Information Integration, Demonstration Optimization through Long-Term Memory, Interactive Coding with Execution, and Rubber Duck Debugging via Error Tracing—enabling iterative plan refinement through environment feedback. Across three real-world EHR datasets (MIMIC-III, eICU, TREQS), EHRAgent significantly outperforms strong baselines in multi-hop reasoning while reducing dependence on large annotated training data by using only a few demonstrations. The work highlights potential improvements to clinician workflows and acknowledges limitations such as extra execution costs and privacy considerations, suggesting future directions like white-box LLMs and cost-efficient strategies.
Abstract
Large language models (LLMs) have demonstrated exceptional capabilities in planning and tool utilization as autonomous agents, but few have been developed for medical problem-solving. We propose EHRAgent, an LLM agent empowered with a code interface, to autonomously generate and execute code for multi-tabular reasoning within electronic health records (EHRs). First, we formulate an EHR question-answering task into a tool-use planning process, efficiently decomposing a complicated task into a sequence of manageable actions. By integrating interactive coding and execution feedback, EHRAgent learns from error messages and improves the originally generated code through iterations. Furthermore, we enhance the LLM agent by incorporating long-term memory, which allows EHRAgent to effectively select and build upon the most relevant successful cases from past experiences. Experiments on three real-world multi-tabular EHR datasets show that EHRAgent outperforms the strongest baseline by up to 29.6% in success rate. EHRAgent leverages the emerging few-shot learning capabilities of LLMs, enabling autonomous code generation and execution to tackle complex clinical tasks with minimal demonstrations.
