PaReGTA: An LLM-based EHR Data Encoding Approach to Capture Temporal Information

Kihyuk Yoon; Lingchao Mao; Catherine Chong; Todd J. Schwedt; Chia-Chun Chiang; Jing Li

PaReGTA: An LLM-based EHR Data Encoding Approach to Capture Temporal Information

Kihyuk Yoon, Lingchao Mao, Catherine Chong, Todd J. Schwedt, Chia-Chun Chiang, Jing Li

TL;DR

PaReGTA tackles the challenge of preserving temporal information in structured EHR data by introducing an end-to-end LLM-based encoding pipeline that templates visit-level events, performs lightweight domain-adaptation of embeddings via SimCSE, and aggregates visits into fixed-dimensional patient representations through a hybrid time-aware pooling scheme. The approach is complemented by PaReGTA-RSS, a perturbation-based representation-shift method that yields patient- and cohort-level factor attributions compatible with the embedding-based pipeline. Evaluated on 39,088 migraine patients from the All of Us dataset, PaReGTA outperforms sparse baselines across prediction tasks and exhibits stability where deep sequential models struggle in real-world heterogeneity. Overall, the work demonstrates that temporally aware, text-based EHR encoding with lightweight adaptation and interpretable attributions can improve clinical prediction while reducing data and computation requirements.

Abstract

Temporal information in structured electronic health records (EHRs) is often lost in sparse one-hot or count-based representations, while sequence models can be costly and data-hungry. We propose PaReGTA, an LLM-based encoding framework that (i) converts longitudinal EHR events into visit-level templated text with explicit temporal cues, (ii) learns domain-adapted visit embeddings via lightweight contrastive fine-tuning of a sentence-embedding model, and (iii) aggregates visit embeddings into a fixed-dimensional patient representation using hybrid temporal pooling that captures both recency and globally informative visits. Because PaReGTA does not require training from scratch but instead utilizes a pre-trained LLM, it can perform well even in data-limited cohorts. Furthermore, PaReGTA is model-agnostic and can benefit from future EHR-specialized sentence-embedding models. For interpretability, we introduce PaReGTA-RSS (Representation Shift Score), which quantifies clinically defined factor importance by recomputing representations after targeted factor removal and projecting representation shifts through a machine learning model. On 39,088 migraine patients from the All of Us Research Program, PaReGTA outperforms sparse baselines for migraine type classification while deep sequential models were unstable in our cohort.

PaReGTA: An LLM-based EHR Data Encoding Approach to Capture Temporal Information

TL;DR

Abstract

Paper Structure (26 sections, 11 equations, 5 figures, 10 tables)

This paper contains 26 sections, 11 equations, 5 figures, 10 tables.

Introduction
Methods
Dataset and cohort construction
PaReGTA: proposed encoding method for EHR dataset
Framework overview
Visit-level textualization of EHR dataset
Domain adaptation via SimCSE
Encoding visit-level representations
Hybrid temporal pooling into patient representations
Factor importance method by simulation for EHR data
Limitations of conventional feature-importance methods
PaReGTA-RSS (Representation Shift Score)
Results
Dataset and task description
Data encoding and model training
...and 11 more sections

Figures (5)

Figure 1: Overview of PaReGTA.
Figure 2: Factor importance of medications and comorbidities for all patients.
Figure 3: Factor importance of medications and comorbidities for male patients.
Figure 4: Factor importance of medications and comorbidities for female patients. (C) is for patients without temporomandibular disorders or fibromyalgia.
Figure 5: Factor importance of the last N EHR records

PaReGTA: An LLM-based EHR Data Encoding Approach to Capture Temporal Information

TL;DR

Abstract

PaReGTA: An LLM-based EHR Data Encoding Approach to Capture Temporal Information

Authors

TL;DR

Abstract

Table of Contents

Figures (5)