Table of Contents
Fetching ...

Analysis of heart failure patient trajectories using sequence modeling

Falk Dippel, Yinan Yu, Annika Rosengren, Martin Lindgren, Christina E. Lundberg, Erik Aerts, Martin Adiels, Helen Sjöland

TL;DR

This paper tackles the need for systematic performance and efficiency insights in EHR-based sequence modeling for heart failure by conducting a comprehensive ablation across architecture classes (Transformers, Transformer++, Mamba) and input/temporal design choices. It analyzes token granularity, context length, model size, and history preprocessing in a large Swedish HF cohort (N = 42,820) across three one-year prediction tasks, revealing that Llama generally achieves the best discrimination and calibration, with strong data efficiency. The study provides concrete design recommendations, notably favoring $C=512$ with compact vocabularies and aggregated histories, and demonstrates that performance can scale with reduced data requirements and selective concept sets. Together, these findings offer practical guidance for developing clinically applicable, resource-efficient sequence models for EHR data and motivate extensions to multi-modal and external-validation studies.

Abstract

Transformers have defined the state-of-the-art for clinical prediction tasks involving electronic health records (EHRs). The recently introduced Mamba architecture outperformed an advanced Transformer (Transformer++) based on Llama in handling long context lengths, while using fewer model parameters. Despite the impressive performance of these architectures, a systematic approach to empirically analyze model performance and efficiency under various settings is not well established in the medical domain. The performances of six sequence models were investigated across three architecture classes (Transformers, Transformers++, Mambas) in a large Swedish heart failure (HF) cohort (N = 42820), providing a clinically relevant case study. Patient data included diagnoses, vital signs, laboratories, medications and procedures extracted from in-hospital EHRs. The models were evaluated on three one-year prediction tasks: clinical instability (a readmission phenotype) after initial HF hospitalization, mortality after initial HF hospitalization and mortality after latest hospitalization. Ablations account for modifications of the EHR-based input patient sequence, architectural model configurations, and temporal preprocessing techniques for data collection. Llama achieves the highest predictive discrimination, best calibration, and showed robustness across all tasks, followed by Mambas. Both architectures demonstrate efficient representation learning, with tiny configurations surpassing other large-scaled Transformers. At equal model size, Llama and Mambas achieve superior performance using 25% less training data. This paper presents a first ablation study with systematic design choices for input tokenization, model configuration and temporal data preprocessing. Future model development in clinical prediction tasks using EHRs could build upon this study's recommendation as a starting point.

Analysis of heart failure patient trajectories using sequence modeling

TL;DR

This paper tackles the need for systematic performance and efficiency insights in EHR-based sequence modeling for heart failure by conducting a comprehensive ablation across architecture classes (Transformers, Transformer++, Mamba) and input/temporal design choices. It analyzes token granularity, context length, model size, and history preprocessing in a large Swedish HF cohort (N = 42,820) across three one-year prediction tasks, revealing that Llama generally achieves the best discrimination and calibration, with strong data efficiency. The study provides concrete design recommendations, notably favoring with compact vocabularies and aggregated histories, and demonstrates that performance can scale with reduced data requirements and selective concept sets. Together, these findings offer practical guidance for developing clinically applicable, resource-efficient sequence models for EHR data and motivate extensions to multi-modal and external-validation studies.

Abstract

Transformers have defined the state-of-the-art for clinical prediction tasks involving electronic health records (EHRs). The recently introduced Mamba architecture outperformed an advanced Transformer (Transformer++) based on Llama in handling long context lengths, while using fewer model parameters. Despite the impressive performance of these architectures, a systematic approach to empirically analyze model performance and efficiency under various settings is not well established in the medical domain. The performances of six sequence models were investigated across three architecture classes (Transformers, Transformers++, Mambas) in a large Swedish heart failure (HF) cohort (N = 42820), providing a clinically relevant case study. Patient data included diagnoses, vital signs, laboratories, medications and procedures extracted from in-hospital EHRs. The models were evaluated on three one-year prediction tasks: clinical instability (a readmission phenotype) after initial HF hospitalization, mortality after initial HF hospitalization and mortality after latest hospitalization. Ablations account for modifications of the EHR-based input patient sequence, architectural model configurations, and temporal preprocessing techniques for data collection. Llama achieves the highest predictive discrimination, best calibration, and showed robustness across all tasks, followed by Mambas. Both architectures demonstrate efficient representation learning, with tiny configurations surpassing other large-scaled Transformers. At equal model size, Llama and Mambas achieve superior performance using 25% less training data. This paper presents a first ablation study with systematic design choices for input tokenization, model configuration and temporal data preprocessing. Future model development in clinical prediction tasks using EHRs could build upon this study's recommendation as a starting point.

Paper Structure

This paper contains 33 sections, 7 equations, 14 figures, 24 tables.

Figures (14)

  • Figure 1: Overview of the ablation study following a simplified development process, from source through model training to clinical prediction: Research question RQ1 explores variations of the tokenized input sequence, RQ2 investigates the performance across different architectural configurations, RQ3 analyzes temporal modifications of the data and RQ4 investigates the model utility for varying data availability. Each ablation is conducted on three clinical tasks for all model types, with varying ablation parameters highlighted in blue. DX=Diagnoses. LAB=Laboratories. MED=Medications. PRO=Procedures. VIT=Vital signs.
  • Figure 2: (a) Overview of the clinical prediction tasks in this study: Given a simplified chronologically sorted patient sequence, separate sequence models were trained to predict at discharge: one-year clinical instability at the initial HF diagnosis in-hospital (trajectory 1), one-year mortality at the initial HF diagnosis in-hospital, and one-year mortality at the time of the latest hospitalization (trajectory 2). (b) Corresponding input sequence representation after the initial HF diagnosis trajectory highlighting the aggregated embeddings of tokenized concepts, concept token types, demographics (age, sex, body mass index (BMI)), time of events, visit numbers, alternating visit segments and absolute token positions.
  • Figure 3: Ablation of the vocabulary $V$ across three clinical tasks $T$ evaluated by bootstrapped AUPRC ($\uparrow$) for $\mathrm{Medium}$-sized sequence models and $C=512$. The resolution of the vocabulary increases from lowest (left) to highest (right). Gray background highlights common setup shared across all ablations.
  • Figure 4: Ablation of the context length $C$ and model size evaluated by bootstrapped AUPRC ($\uparrow$) across three different clinical tasks. Within each $C$ sequence modes are sorted by $\mathrm{\underline{T}iny}$, $\mathrm{\underline{S}mall}$, and $\mathrm{\underline{M}edium}$ configuration, and compared to 's $\mathrm{\underline{D}efault}$ configuration. Gray background highlights common setup shared across all ablations.
  • Figure 5: Ablation of the patient history $H$: The historical record is extended from the latest available hospitalization (left) to a prolonged context (right). All modifications are evaluated by bootstrapped AUPRC ($\uparrow$) using $\mathrm{Medium}$-sized sequence models and $C=512$. Gray background highlights common setup shared across all ablations.
  • ...and 9 more figures