Table of Contents
Fetching ...

MedAlign: A Clinician-Generated Dataset for Instruction Following with Electronic Medical Records

Scott L. Fleming, Alejandro Lozano, William J. Haberkorn, Jenelle A. Jindal, Eduardo P. Reis, Rahul Thapa, Louis Blankemeier, Julian Z. Genkins, Ethan Steinberg, Ashwin Nayak, Birju S. Patel, Chia-Chun Chiang, Alison Callahan, Zepeng Huo, Sergios Gatidis, Scott J. Adams, Oluseyi Fayanju, Shreya J. Shah, Thomas Savage, Ethan Goh, Akshay S. Chaudhari, Nima Aghaeepour, Christopher Sharp, Michael A. Pfeffer, Percy Liang, Jonathan H. Chen, Keith E. Morse, Emma P. Brunskill, Jason A. Fries, Nigam H. Shah

TL;DR

MedAlign introduces a clinician-generated benchmark for instruction-following on electronic health records (EHRs), pairing 983 physician-written instructions with 276 longitudinal EHRs via XML markup and BM25-based matching. The dataset enables ground-truth, clinician-annotated gold responses for 303 instruction-EHR pairs and supports evaluation of six large language models, revealing substantial error rates and the critical role of context length. The study also demonstrates correlations between automated NLG metrics (notably COMET) and clinician rankings, offering a scalable proxy for human evaluation. Together, MedAlign provides a realistic, scalable framework to benchmark and improve long-context EHR tasks, while highlighting ethical safeguards and the potential impact on clinician workload and patient safety.

Abstract

The ability of large language models (LLMs) to follow natural language instructions with human-level fluency suggests many opportunities in healthcare to reduce administrative burden and improve quality of care. However, evaluating LLMs on realistic text generation tasks for healthcare remains challenging. Existing question answering datasets for electronic health record (EHR) data fail to capture the complexity of information needs and documentation burdens experienced by clinicians. To address these challenges, we introduce MedAlign, a benchmark dataset of 983 natural language instructions for EHR data. MedAlign is curated by 15 clinicians (7 specialities), includes clinician-written reference responses for 303 instructions, and provides 276 longitudinal EHRs for grounding instruction-response pairs. We used MedAlign to evaluate 6 general domain LLMs, having clinicians rank the accuracy and quality of each LLM response. We found high error rates, ranging from 35% (GPT-4) to 68% (MPT-7B-Instruct), and an 8.3% drop in accuracy moving from 32k to 2k context lengths for GPT-4. Finally, we report correlations between clinician rankings and automated natural language generation metrics as a way to rank LLMs without human review. We make MedAlign available under a research data use agreement to enable LLM evaluations on tasks aligned with clinician needs and preferences.

MedAlign: A Clinician-Generated Dataset for Instruction Following with Electronic Medical Records

TL;DR

MedAlign introduces a clinician-generated benchmark for instruction-following on electronic health records (EHRs), pairing 983 physician-written instructions with 276 longitudinal EHRs via XML markup and BM25-based matching. The dataset enables ground-truth, clinician-annotated gold responses for 303 instruction-EHR pairs and supports evaluation of six large language models, revealing substantial error rates and the critical role of context length. The study also demonstrates correlations between automated NLG metrics (notably COMET) and clinician rankings, offering a scalable proxy for human evaluation. Together, MedAlign provides a realistic, scalable framework to benchmark and improve long-context EHR tasks, while highlighting ethical safeguards and the potential impact on clinician workload and patient safety.

Abstract

The ability of large language models (LLMs) to follow natural language instructions with human-level fluency suggests many opportunities in healthcare to reduce administrative burden and improve quality of care. However, evaluating LLMs on realistic text generation tasks for healthcare remains challenging. Existing question answering datasets for electronic health record (EHR) data fail to capture the complexity of information needs and documentation burdens experienced by clinicians. To address these challenges, we introduce MedAlign, a benchmark dataset of 983 natural language instructions for EHR data. MedAlign is curated by 15 clinicians (7 specialities), includes clinician-written reference responses for 303 instructions, and provides 276 longitudinal EHRs for grounding instruction-response pairs. We used MedAlign to evaluate 6 general domain LLMs, having clinicians rank the accuracy and quality of each LLM response. We found high error rates, ranging from 35% (GPT-4) to 68% (MPT-7B-Instruct), and an 8.3% drop in accuracy moving from 32k to 2k context lengths for GPT-4. Finally, we report correlations between clinician rankings and automated natural language generation metrics as a way to rank LLMs without human review. We make MedAlign available under a research data use agreement to enable LLM evaluations on tasks aligned with clinician needs and preferences.
Paper Structure (48 sections, 12 figures, 21 tables)

This paper contains 48 sections, 12 figures, 21 tables.

Figures (12)

  • Figure 1: In MedAlign, patient EHRs are transformed into XML markup (example provided in Figure \ref{['fig:xml_markup_example']}) and paired with clinician-generated instructions using a retrieval-based (BM25) scoring metric. The resulting set of instruction + EHR pairs is then reviewed by clinicians to write gold responses, which are used to evaluate EHR instruction following in large language models
  • Figure 2: (Left) Head-to-head comparison of model performance based on human ranks. The number in row $i$, column $j$ indicates the proportion of instructions for which the response generated by the model in row $i$ was strictly preferred over the model in column $j$. (Right) Head-to-head evaluation of model performance using COMET Ranks. Represents the same matrix structure and interpretation as on the left, but using rankings derived from COMET, an automated metric, rather than clinician-generated rankings. Model win rates using COMET follow a similar pattern as to model win rates using human rankings.
  • Figure 3: Automated evaluation of medical instruction-tuned LLMs vs. general instruction-tuned counterparts using the best-performing metrics (COMET and BERTScore).
  • Figure S1: MedAlign cohort diagram: selection criteria for the construction of relevant instruction-EHR pairs assessed by clinicians.
  • Figure S2: Treemap of the clinical instruction categories (taxonomy) assigned by a clinician. Each category within the treemap is associated with a parent class derived from the clinician-generated taxonomy.
  • ...and 7 more figures