Table of Contents
Fetching ...

Cross-Representation Benchmarking in Time-Series Electronic Health Records for Clinical Outcome Prediction

Tianyi Chen, Mingcheng Zhu, Zhiyao Luo, Tingting Zhu

TL;DR

The paper tackles the lack of fair, cross-representation evaluation for EHR-based clinical prediction by benchmarking three data representations—multivariate time-series, event streams, and textual event streams—across ICU and longitudinal tasks. It introduces a reproducible pipeline to curate data, instantiate representations, and evaluate models under consistent splits, enabling apples-to-apples comparisons. Key findings show event-stream models consistently outperform others, pretrained CLMBR is highly sample-efficient in few-shot settings, and feature pruning by missingness improves ICU predictions while retaining sparse features benefits longitudinal tasks. The results provide practical guidance for selecting EHR representations based on clinical context and data regime, with implications for model deployment and uncertainty analysis.

Abstract

Electronic Health Records (EHRs) enable deep learning for clinical predictions, but the optimal method for representing patient data remains unclear due to inconsistent evaluation practices. We present the first systematic benchmark to compare EHR representation methods, including multivariate time-series, event streams, and textual event streams for LLMs. This benchmark standardises data curation and evaluation across two distinct clinical settings: the MIMIC-IV dataset for ICU tasks (mortality, phenotyping) and the EHRSHOT dataset for longitudinal care (30-day readmission, 1-year pancreatic cancer). For each paradigm, we evaluate appropriate modelling families--including Transformers, MLP, LSTMs and Retain for time-series, CLMBR and count-based models for event streams, 8-20B LLMs for textual streams--and analyse the impact of feature pruning based on data missingness. Our experiments reveal that event stream models consistently deliver the strongest performance. Pre-trained models like CLMBR are highly sample-efficient in few-shot settings, though simpler count-based models can be competitive given sufficient data. Furthermore, we find that feature selection strategies must be adapted to the clinical setting: pruning sparse features improves ICU predictions, while retaining them is critical for longitudinal tasks. Our results, enabled by a unified and reproducible pipeline, provide practical guidance for selecting EHR representations based on the clinical context and data regime.

Cross-Representation Benchmarking in Time-Series Electronic Health Records for Clinical Outcome Prediction

TL;DR

The paper tackles the lack of fair, cross-representation evaluation for EHR-based clinical prediction by benchmarking three data representations—multivariate time-series, event streams, and textual event streams—across ICU and longitudinal tasks. It introduces a reproducible pipeline to curate data, instantiate representations, and evaluate models under consistent splits, enabling apples-to-apples comparisons. Key findings show event-stream models consistently outperform others, pretrained CLMBR is highly sample-efficient in few-shot settings, and feature pruning by missingness improves ICU predictions while retaining sparse features benefits longitudinal tasks. The results provide practical guidance for selecting EHR representations based on clinical context and data regime, with implications for model deployment and uncertainty analysis.

Abstract

Electronic Health Records (EHRs) enable deep learning for clinical predictions, but the optimal method for representing patient data remains unclear due to inconsistent evaluation practices. We present the first systematic benchmark to compare EHR representation methods, including multivariate time-series, event streams, and textual event streams for LLMs. This benchmark standardises data curation and evaluation across two distinct clinical settings: the MIMIC-IV dataset for ICU tasks (mortality, phenotyping) and the EHRSHOT dataset for longitudinal care (30-day readmission, 1-year pancreatic cancer). For each paradigm, we evaluate appropriate modelling families--including Transformers, MLP, LSTMs and Retain for time-series, CLMBR and count-based models for event streams, 8-20B LLMs for textual streams--and analyse the impact of feature pruning based on data missingness. Our experiments reveal that event stream models consistently deliver the strongest performance. Pre-trained models like CLMBR are highly sample-efficient in few-shot settings, though simpler count-based models can be competitive given sufficient data. Furthermore, we find that feature selection strategies must be adapted to the clinical setting: pruning sparse features improves ICU predictions, while retaining them is critical for longitudinal tasks. Our results, enabled by a unified and reproducible pipeline, provide practical guidance for selecting EHR representations based on the clinical context and data regime.

Paper Structure

This paper contains 9 sections, 3 figures, 1 table.

Figures (3)

  • Figure 1: Different Representation Methods for EHRs. (a) multivariate time-series, with imputation; (b) event stream; (c) textual event stream, used for LLMs.
  • Figure 2: Avg. performance of models across ICU mortality, phenotyping, 30-day readmission, and 1-year pancreatic cancer predictions.
  • Figure 3: Comparison of model performance across different representations with different missing rate thresholds.