Table of Contents
Fetching ...

EHR-R1: A Reasoning-Enhanced Foundational Language Model for Electronic Health Record Analysis

Yusheng Liao, Chaoyi Wu, Junwei Liu, Shuyang Jiang, Pengcheng Qiu, Haowen Wang, Yun Yue, Shuai Zhen, Jian Wang, Qianrui Fan, Jinjie Gu, Ya Zhang, Yanfeng Wang, Yu Wang, Weidi Xie

TL;DR

This work presents EHR-R1, a reasoning enhanced large language model tailored for electronic health record analysis, built atop the large EHR-Ins super-instruction data generated with a thinking-graph reasoning synthesis pipeline. The authors propose a three stage training curriculum comprising domain adaptation, reasoning enhancement, and reinforcement learning with Group Relative Policy Optimization to imbue EHR-R1 with robust domain knowledge and longitudinal clinical reasoning. They introduce EHR-Bench as a comprehensive MIMIC-IV based benchmark spanning 42 tasks to test reasoning and prediction, and demonstrate that EHR-R1-72B achieves state of the art performance across decision making and risk prediction, with strong zero shot generalization to EHRSHOT and MIMIC-IV-CDM datasets. The results highlight significant improvements over leading LLMs and emphasize the value of explicit reasoning pathways that are grounded in clinical knowledge, offering a scalable path toward more reliable and clinically relevant EHR analysis.

Abstract

Electronic Health Records (EHRs) contain rich yet complex information, and their automated analysis is critical for clinical decision-making. Despite recent advances of large language models (LLMs) in clinical workflows, their ability to analyze EHRs remains limited due to narrow task coverage and lack of EHR-oriented reasoning capabilities. This paper aims to bridge the gap, specifically, we present EHR-Ins, a large-scale, comprehensive EHR reasoning instruction dataset, comprising 300k high-quality reasoning cases and 4M non-reasoning cases across 42 distinct EHR tasks. Its core innovation is a thinking-graph-driven framework that enables to generate high-quality reasoning data at scale. Based on it, we develop EHR-R1, a series of reasoning-enhanced LLMs with up to 72B parameters tailored for EHR analysis. Through a multi-stage training paradigm, including domain adaptation, reasoning enhancement, and reinforcement learning, EHR-R1 systematically acquires domain knowledge and diverse reasoning capabilities, enabling accurate and robust EHR analysis. Lastly, we introduce EHR-Bench, a new benchmark curated from MIMIC-IV, spanning 42 tasks, to comprehensively assess reasoning and prediction across EHR scenarios. In experiments, we show that the resulting EHR-R1 consistently outperforms state-of-the-art commercial and open-source LLMs (including DeepSeek-V3 and GPT-4o), surpassing GPT-4o by over 30 points on MIMIC-Bench and achieving a 10\% higher zero-shot AUROC on EHRSHOT. Collectively, EHR-Ins, EHR-R1, and EHR-Bench have significantly advanced the development for more reliable and clinically relevant EHR analysis.

EHR-R1: A Reasoning-Enhanced Foundational Language Model for Electronic Health Record Analysis

TL;DR

This work presents EHR-R1, a reasoning enhanced large language model tailored for electronic health record analysis, built atop the large EHR-Ins super-instruction data generated with a thinking-graph reasoning synthesis pipeline. The authors propose a three stage training curriculum comprising domain adaptation, reasoning enhancement, and reinforcement learning with Group Relative Policy Optimization to imbue EHR-R1 with robust domain knowledge and longitudinal clinical reasoning. They introduce EHR-Bench as a comprehensive MIMIC-IV based benchmark spanning 42 tasks to test reasoning and prediction, and demonstrate that EHR-R1-72B achieves state of the art performance across decision making and risk prediction, with strong zero shot generalization to EHRSHOT and MIMIC-IV-CDM datasets. The results highlight significant improvements over leading LLMs and emphasize the value of explicit reasoning pathways that are grounded in clinical knowledge, offering a scalable path toward more reliable and clinically relevant EHR analysis.

Abstract

Electronic Health Records (EHRs) contain rich yet complex information, and their automated analysis is critical for clinical decision-making. Despite recent advances of large language models (LLMs) in clinical workflows, their ability to analyze EHRs remains limited due to narrow task coverage and lack of EHR-oriented reasoning capabilities. This paper aims to bridge the gap, specifically, we present EHR-Ins, a large-scale, comprehensive EHR reasoning instruction dataset, comprising 300k high-quality reasoning cases and 4M non-reasoning cases across 42 distinct EHR tasks. Its core innovation is a thinking-graph-driven framework that enables to generate high-quality reasoning data at scale. Based on it, we develop EHR-R1, a series of reasoning-enhanced LLMs with up to 72B parameters tailored for EHR analysis. Through a multi-stage training paradigm, including domain adaptation, reasoning enhancement, and reinforcement learning, EHR-R1 systematically acquires domain knowledge and diverse reasoning capabilities, enabling accurate and robust EHR analysis. Lastly, we introduce EHR-Bench, a new benchmark curated from MIMIC-IV, spanning 42 tasks, to comprehensively assess reasoning and prediction across EHR scenarios. In experiments, we show that the resulting EHR-R1 consistently outperforms state-of-the-art commercial and open-source LLMs (including DeepSeek-V3 and GPT-4o), surpassing GPT-4o by over 30 points on MIMIC-Bench and achieving a 10\% higher zero-shot AUROC on EHRSHOT. Collectively, EHR-Ins, EHR-R1, and EHR-Bench have significantly advanced the development for more reliable and clinically relevant EHR analysis.

Paper Structure

This paper contains 68 sections, 16 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Overview of the EHR tasks and the proposed method.a. EHR Analysis Tasks. EHR analysis tasks are defined as consisting of two types of tasks: decision-making and risk-prediction. b. Methods Overview. Our approach addresses these challenges with a three-stage training pipeline. First, a large volume of non-reasoning data is used for continual pre-training. This is followed by an instruction-tuning phase that leverages reasoning data. Finally, reinforcement learning with Group Reward Policy Optimization (GRPO) is applied to further refine the model. c. Results. This figure compares the performance of our model against several baseline LLMs on both decision-making and risk-prediction tasks, showcasing its superior performance.
  • Figure 1: A case study of EHR Trajectory, Medical Relation, and Reasoning Chain. (a) EHR Trajectory for a patient, where <events>... and <item info>... represent the omission of a large amount of information for display purposes. (b) Medical Relation, showing the connections between the context medical entities and target items. (c) Reasoning Chain, detailing the process of inferring a diagnosis from the patient's EHR data. The parts highlighted in bold are the content commonly found in the EHR Trajectory, Medical Relation, and Reasoning Chain. This indicates that the medical graph is effective in identifying valid medical entities from the trajectory and using them to enhance reasoning.
  • Figure 2: Overview of reasoning data in EHR-Ins.a The sample size of each task in the reasoning data of EHR-Ins. b Example of human evaluation on the reasoning data. c Manual evaluation results on EHR reasoning data across eight decision-making tasks, each associated with a distinct type of decision-making event. We compared the quality of synthetic reasoning data with and without thinking-graph enhancement, where '***' represents a significance level of $p<0.001$.
  • Figure 3: Overview of EHR-Ins and EHR-Bench.. The hierarchical ring chart displays the distribution of both datasets. The inner ring partitions tasks into two types: risk prediction and decision making. The middle ring shows 12 task categories (subtypes). The outer ring details all 42 specific tasks.
  • Figure 4: Performance comparison of EHR-R1 and nine baseline LLMs across 24 decision-making tasks on EHR-Bench. The performance is measured with F1 score. Cross-hatched bars denote reasoning-enhanced models, highlighting the effect of explicit reasoning. In each subplot, our EHR-R1 (rightmost bar) achieves a clear performance advantage on nearly all tasks.
  • ...and 5 more figures