Predicting clinical outcomes from patient care pathways represented with temporal knowledge graphs
Jong Ho Jhee, Alberto Megina, Pacôme Constant Dit Beaufils, Matilde Karakachoff, Richard Redon, Alban Gaignard, Adrien Coulet
TL;DR
The paper investigates whether knowledge graph representations of patient care pathways can improve clinical outcome prediction for ruptured intracranial aneurysms. It compares tabular baselines with graph-based embeddings (TransE, RDF2Vec, RGCN+Lit) across SPHN and CARE-SM schemas and various time modeling choices, finding that RGCN+Lit on SPHN yields the best performance. A publicly released synthetic dataset and transformation scripts enable reproducibility, and results highlight the value of compact patient-centric schemas and literal-aware embeddings while showing time encoding has a nuanced effect. The work points to practical potential for KG-based predictive tools in healthcare, while noting limitations such as class imbalance and the need for clinical validation.
Abstract
Background: With the increasing availability of healthcare data, predictive modeling finds many applications in the biomedical domain, such as the evaluation of the level of risk for various conditions, which in turn can guide clinical decision making. However, it is unclear how knowledge graph data representations and their embedding, which are competitive in some settings, could be of interest in biomedical predictive modeling. Method: We simulated synthetic but realistic data of patients with intracranial aneurysm and experimented on the task of predicting their clinical outcome. We compared the performance of various classification approaches on tabular data versus a graph-based representation of the same data. Next, we investigated how the adopted schema for representing first individual data and second temporal data impacts predictive performances. Results: Our study illustrates that in our case, a graph representation and Graph Convolutional Network (GCN) embeddings reach the best performance for a predictive task from observational data. We emphasize the importance of the adopted schema and of the consideration of literal values in the representation of individual data. Our study also moderates the relative impact of various time encoding on GCN performance.
