Table of Contents
Fetching ...

TIMER: Temporal Instruction Modeling and Evaluation for Longitudinal Clinical Records

Hejie Cui, Alyssa Unell, Bowen Chen, Jason Alan Fries, Emily Alsentzer, Sanmi Koyejo, Nigam Shah

TL;DR

This work addresses the challenge that large language models struggle to reason over longitudinal EHRs. It introduces TIMER, a framework consisting of TIMER-Bench for time-aware evaluation and TIMER-Instruct for temporal instruction tuning, grounded to explicit time evidence across patient timelines. The authors show that temporal-aware tuning improves performance by about 7.3% on physician-generated benchmarks and 9.2% on TIMER-Bench, illustrating the importance of temporal distribution in training data. The approach offers a practical pathway to enhance longitudinal clinical reasoning and could be extended to other domains requiring multi-timepoint understanding of events.

Abstract

Large language models (LLMs) have emerged as promising tools for assisting in medical tasks, yet processing Electronic Health Records (EHRs) presents unique challenges due to their longitudinal nature. While LLMs' capabilities to perform medical tasks continue to improve, their ability to reason over temporal dependencies across multiple patient visits and time frames remains unexplored. We introduce TIMER (Temporal Instruction Modeling and Evaluation for Longitudinal Clinical Records), a framework that incorporate instruction-response pairs grounding to different parts of a patient's record as a critical dimension in both instruction evaluation and tuning for longitudinal clinical records. We develop TIMER-Bench, the first time-aware benchmark that evaluates temporal reasoning capabilities over longitudinal EHRs, as well as TIMER-Instruct, an instruction-tuning methodology for LLMs to learn reasoning over time. We demonstrate that models fine-tuned with TIMER-Instruct improve performance by 7.3% on human-generated benchmarks and 9.2% on TIMER-Bench, indicating that temporal instruction-tuning improves model performance for reasoning over EHR.

TIMER: Temporal Instruction Modeling and Evaluation for Longitudinal Clinical Records

TL;DR

This work addresses the challenge that large language models struggle to reason over longitudinal EHRs. It introduces TIMER, a framework consisting of TIMER-Bench for time-aware evaluation and TIMER-Instruct for temporal instruction tuning, grounded to explicit time evidence across patient timelines. The authors show that temporal-aware tuning improves performance by about 7.3% on physician-generated benchmarks and 9.2% on TIMER-Bench, illustrating the importance of temporal distribution in training data. The approach offers a practical pathway to enhance longitudinal clinical reasoning and could be extended to other domains requiring multi-timepoint understanding of events.

Abstract

Large language models (LLMs) have emerged as promising tools for assisting in medical tasks, yet processing Electronic Health Records (EHRs) presents unique challenges due to their longitudinal nature. While LLMs' capabilities to perform medical tasks continue to improve, their ability to reason over temporal dependencies across multiple patient visits and time frames remains unexplored. We introduce TIMER (Temporal Instruction Modeling and Evaluation for Longitudinal Clinical Records), a framework that incorporate instruction-response pairs grounding to different parts of a patient's record as a critical dimension in both instruction evaluation and tuning for longitudinal clinical records. We develop TIMER-Bench, the first time-aware benchmark that evaluates temporal reasoning capabilities over longitudinal EHRs, as well as TIMER-Instruct, an instruction-tuning methodology for LLMs to learn reasoning over time. We demonstrate that models fine-tuned with TIMER-Instruct improve performance by 7.3% on human-generated benchmarks and 9.2% on TIMER-Bench, indicating that temporal instruction-tuning improves model performance for reasoning over EHR.

Paper Structure

This paper contains 26 sections, 1 equation, 9 figures, 14 tables.

Figures (9)

  • Figure 1: Overview of temporal instruction modeling challenges and our TIMER framework. Top: Existing benchmarks suffer from limited temporal coverage and recent-context bias, while baseline models show critical failures in longitudinal reasoning. Bottom: The TIMER framework addresses these limitations through temporal-aware distribution evaluation (TIMER-Bench) and instruction tuning (TIMER-Instruct), achieving significant improvements in both human-curated and model-generated benchmarks.
  • Figure 2: MedAlign instruction benchmark for longitudinal records emphasizes recent portions of each patient's longitudinal record.
  • Figure 3: Overview of TIMER framework. TIMER-Bench creates evaluation sets with explicit temporal evidence, covering questions across different time periods in patient histories to assess longitudinal EHR reasoning. Right: TIMER-Instruct enhances model performance through instruction tuning with instruction-response pairs generated by LLMs that distribute temporally diverse across EHR timelines.
  • Figure 4: Temporal distribution in model-generated instruction-response pairs reveals a "lost-in-the-middle" phenomenon. Using our normalized temporal position metric (x-axis: 0% to 100% of timeline), we find that instructions strongly favor timeline edges. The density plot shows high concentration at recent and early periods, while middle periods receive significantly less attention.
  • Figure 5: We evaluate on three benchmarks with varying temporal distributions: recent-focused, edge-focused, and uniform.
  • ...and 4 more figures