Table of Contents
Fetching ...

EHRStruct: A Comprehensive Benchmark Framework for Evaluating Large Language Models on Structured Electronic Health Record Tasks

Xiao Yang, Xuejiao Zhao, Zhiqi Shen

TL;DR

EHRStruct presents a rigorous benchmark for evaluating LLMs on structured EHR tasks, addressing the lack of standardization in prior work by defining 11 clinically grounded tasks and assembling 2,200 samples from synthetic and real EHR data. The study systematically evaluates 20 LLMs and 11 enhancement methods, analyzes input formats, few-shot learning, and fine-tuning effects, and introduces EHRMaster—a code-augmented framework that achieves state-of-the-art results, particularly on Data-Driven tasks. Key findings reveal general-purpose LLMs often surpass medical-domain models, with performance varying across task types and highlighting the need for improved clinical knowledge integration. The work provides actionable insights and a practical platform for advancing structured EHR reasoning, with potential impact on clinical decision support and AI benchmarking.

Abstract

Structured Electronic Health Record (EHR) data stores patient information in relational tables and plays a central role in clinical decision-making. Recent advances have explored the use of large language models (LLMs) to process such data, showing promise across various clinical tasks. However, the absence of standardized evaluation frameworks and clearly defined tasks makes it difficult to systematically assess and compare LLM performance on structured EHR data. To address these evaluation challenges, we introduce EHRStruct, a benchmark specifically designed to evaluate LLMs on structured EHR tasks. EHRStruct defines 11 representative tasks spanning diverse clinical needs and includes 2,200 task-specific evaluation samples derived from two widely used EHR datasets. We use EHRStruct to evaluate 20 advanced and representative LLMs, covering both general and medical models. We further analyze key factors influencing model performance, including input formats, few-shot generalisation, and finetuning strategies, and compare results with 11 state-of-the-art LLM-based enhancement methods for structured data reasoning. Our results indicate that many structured EHR tasks place high demands on the understanding and reasoning capabilities of LLMs. In response, we propose EHRMaster, a code-augmented method that achieves state-of-the-art performance and offers practical insights to guide future research.

EHRStruct: A Comprehensive Benchmark Framework for Evaluating Large Language Models on Structured Electronic Health Record Tasks

TL;DR

EHRStruct presents a rigorous benchmark for evaluating LLMs on structured EHR tasks, addressing the lack of standardization in prior work by defining 11 clinically grounded tasks and assembling 2,200 samples from synthetic and real EHR data. The study systematically evaluates 20 LLMs and 11 enhancement methods, analyzes input formats, few-shot learning, and fine-tuning effects, and introduces EHRMaster—a code-augmented framework that achieves state-of-the-art results, particularly on Data-Driven tasks. Key findings reveal general-purpose LLMs often surpass medical-domain models, with performance varying across task types and highlighting the need for improved clinical knowledge integration. The work provides actionable insights and a practical platform for advancing structured EHR reasoning, with potential impact on clinical decision support and AI benchmarking.

Abstract

Structured Electronic Health Record (EHR) data stores patient information in relational tables and plays a central role in clinical decision-making. Recent advances have explored the use of large language models (LLMs) to process such data, showing promise across various clinical tasks. However, the absence of standardized evaluation frameworks and clearly defined tasks makes it difficult to systematically assess and compare LLM performance on structured EHR data. To address these evaluation challenges, we introduce EHRStruct, a benchmark specifically designed to evaluate LLMs on structured EHR tasks. EHRStruct defines 11 representative tasks spanning diverse clinical needs and includes 2,200 task-specific evaluation samples derived from two widely used EHR datasets. We use EHRStruct to evaluate 20 advanced and representative LLMs, covering both general and medical models. We further analyze key factors influencing model performance, including input formats, few-shot generalisation, and finetuning strategies, and compare results with 11 state-of-the-art LLM-based enhancement methods for structured data reasoning. Our results indicate that many structured EHR tasks place high demands on the understanding and reasoning capabilities of LLMs. In response, we propose EHRMaster, a code-augmented method that achieves state-of-the-art performance and offers practical insights to guide future research.

Paper Structure

This paper contains 34 sections, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Overview of EHRStruct. The figure illustrates the four key components of the benchmark: (1) task synthesis through clinical needs induction and task distillation from prior research; (2) taxonomy construction based on clinical scenarios and reasoning levels; (3) task-specific sample extraction from real and synthetic EHR data; and (4) the model evaluation pipeline, including table input, format conversion, model inference, and answer evaluation.
  • Figure 2: Relative Performance Gains from Different input formats across LLMs.
  • Figure 3: Performance of representative LLMs on two Scenarios under few-shot (1, 3, and 5-shot) learning settings.
  • Figure 4: Finetuning results on all targeted categories. Single-task indicates separate finetuning on each task; multi-task indicates joint finetuning across all tasks.
  • Figure 5: Comparison of relative gains for 11 SOTA methods across tasks. Relative gain is defined as the percentage of improvement each method achieves toward the maximum possible gain for each task, where 0% indicates no improvement and 100% represents the upper bound. In each subfigure, the left side shows Data-Driven tasks, and the right side shows Knowledge-Driven tasks.
  • ...and 1 more figures