Patient Trajectory Prediction: Integrating Clinical Notes with Transformers
Sifal Klioui, Sana Sellami, Youssef Trardi
TL;DR
This work addresses predicting patient disease trajectories from EHRs by bridging structured diagnostic codes and unstructured clinical notes using transformers. It introduces Clinical Mosaic, a note-aware transformer pretrained on MIMIC-IV-NOTES 2.2 with 512-token sequences and ALiBi, and demonstrates fusion of note embeddings with CCS code embeddings in an encoder-decoder architecture. Across MIMIC-IV datasets, the note-augmented model outperforms structured-data baselines on the ranking metrics $MAP@K$ and $MAR@K$ (with $K$ in {20,40,60}), showing particularly strong gains for $MAR@K$. The results suggest that incorporating clinical notes preserves long-range dependencies and contextual reasoning, enabling more accurate trajectory predictions; future work includes continual learning and end-to-end automated pipelines.
Abstract
Predicting disease trajectories from electronic health records (EHRs) is a complex task due to major challenges such as data non-stationarity, high granularity of medical codes, and integration of multimodal data. EHRs contain both structured data, such as diagnostic codes, and unstructured data, such as clinical notes, which hold essential information often overlooked. Current models, primarily based on structured data, struggle to capture the complete medical context of patients, resulting in a loss of valuable information. To address this issue, we propose an approach that integrates unstructured clinical notes into transformer-based deep learning models for sequential disease prediction. This integration enriches the representation of patients' medical histories, thereby improving the accuracy of diagnosis predictions. Experiments on MIMIC-IV datasets demonstrate that the proposed approach outperforms traditional models relying solely on structured data.
