Temporal Cross-Attention for Dynamic Embedding and Tokenization of Multimodal Electronic Health Records
Yingbo Ma, Suraj Kolla, Dhruv Kaliraman, Victoria Nolan, Zhenhong Hu, Ziyuan Guan, Yuanfang Ren, Brooke Armfield, Tezcan Ozrazgat-Baslanti, Tyler J. Loftus, Parisa Rashidi, Azra Bihorac, Benjamin Shickel
TL;DR
This work tackles the prediction of postoperative complications from multimodal EHR data by introducing a dynamic embedding and tokenization framework that captures temporal structure in heterogeneous time series. It combines flexible and time-aware encodings, including Time2Vec-based learnable time representations and variable-specific encoders, with a relative positional mechanism to preserve local dependencies. Multimodal fusion is achieved through cross-attention between structured time series and unstructured clinical notes encoded by a pretrained Longformer, enabling joint representations. On real-world data from three hospitals, the approach outperforms strong baselines in a multitask setup across nine complications, with cross-modal fusion providing further gains, highlighting the practical potential for improved patient trajectory modeling and outcome prediction.
Abstract
The breadth, scale, and temporal granularity of modern electronic health records (EHR) systems offers great potential for estimating personalized and contextual patient health trajectories using sequential deep learning. However, learning useful representations of EHR data is challenging due to its high dimensionality, sparsity, multimodality, irregular and variable-specific recording frequency, and timestamp duplication when multiple measurements are recorded simultaneously. Although recent efforts to fuse structured EHR and unstructured clinical notes suggest the potential for more accurate prediction of clinical outcomes, less focus has been placed on EHR embedding approaches that directly address temporal EHR challenges by learning time-aware representations from multimodal patient time series. In this paper, we introduce a dynamic embedding and tokenization framework for precise representation of multimodal clinical time series that combines novel methods for encoding time and sequential position with temporal cross-attention. Our embedding and tokenization framework, when integrated into a multitask transformer classifier with sliding window attention, outperformed baseline approaches on the exemplar task of predicting the occurrence of nine postoperative complications of more than 120,000 major inpatient surgeries using multimodal data from three hospitals and two academic health centers in the United States.
