Table of Contents
Fetching ...

KEEP: Integrating Medical Ontologies with Clinical Data for Robust Code Embeddings

Ahmed Elhussein, Paul Meddeb, Abigail Newbury, Jeanne Mirone, Martin Stoll, Gamze Gursoy

TL;DR

The paper addresses the challenge of learning robust representations for structured medical codes by leveraging both ontological knowledge graphs and real-world co-occurrence patterns. It introduces KEEP, a two-stage embedding framework that first derives knowledge-graph–preserving embeddings via node2vec and then refines them through regularized training with patient-history co-occurrences, optimizing $J(W) = L_{\text{GloVe}} + L_{\text{reg}}$ with $L_{\text{reg}} = \lambda \sum_i || w_i - w_i^{\text{n2v}} ||^2$. The method uses OMOP-based disease graphs with depth up to five and evaluates on UK Biobank and MIMIC-IV, showing superior semantic relational encoding and downstream clinical prediction under resource constraints. KEEP achieves better intrinsic and extrinsic performance than traditional embeddings and pre-trained LMs, while requiring substantially less computation, highlighting its practical value for diverse healthcare settings. The approach can complement LM-based systems via ontology-informed initialization or multimodal fusion, and future work will extend the graph with additional relation types and temporal dynamics.

Abstract

Machine learning in healthcare requires effective representation of structured medical codes, but current methods face a trade off: knowledge graph based approaches capture formal relationships but miss real world patterns, while data driven methods learn empirical associations but often overlook structured knowledge in medical terminologies. We present KEEP (Knowledge preserving and Empirically refined Embedding Process), an efficient framework that bridges this gap by combining knowledge graph embeddings with adaptive learning from clinical data. KEEP first generates embeddings from knowledge graphs, then employs regularized training on patient records to adaptively integrate empirical patterns while preserving ontological relationships. Importantly, KEEP produces final embeddings without task specific auxiliary or end to end training enabling KEEP to support multiple downstream applications and model architectures. Evaluations on structured EHR from UK Biobank and MIMIC IV demonstrate that KEEP outperforms both traditional and Language Model based approaches in capturing semantic relationships and predicting clinical outcomes. Moreover, KEEP's minimal computational requirements make it particularly suitable for resource constrained environments.

KEEP: Integrating Medical Ontologies with Clinical Data for Robust Code Embeddings

TL;DR

The paper addresses the challenge of learning robust representations for structured medical codes by leveraging both ontological knowledge graphs and real-world co-occurrence patterns. It introduces KEEP, a two-stage embedding framework that first derives knowledge-graph–preserving embeddings via node2vec and then refines them through regularized training with patient-history co-occurrences, optimizing with . The method uses OMOP-based disease graphs with depth up to five and evaluates on UK Biobank and MIMIC-IV, showing superior semantic relational encoding and downstream clinical prediction under resource constraints. KEEP achieves better intrinsic and extrinsic performance than traditional embeddings and pre-trained LMs, while requiring substantially less computation, highlighting its practical value for diverse healthcare settings. The approach can complement LM-based systems via ontology-informed initialization or multimodal fusion, and future work will extend the graph with additional relation types and temporal dynamics.

Abstract

Machine learning in healthcare requires effective representation of structured medical codes, but current methods face a trade off: knowledge graph based approaches capture formal relationships but miss real world patterns, while data driven methods learn empirical associations but often overlook structured knowledge in medical terminologies. We present KEEP (Knowledge preserving and Empirically refined Embedding Process), an efficient framework that bridges this gap by combining knowledge graph embeddings with adaptive learning from clinical data. KEEP first generates embeddings from knowledge graphs, then employs regularized training on patient records to adaptively integrate empirical patterns while preserving ontological relationships. Importantly, KEEP produces final embeddings without task specific auxiliary or end to end training enabling KEEP to support multiple downstream applications and model architectures. Evaluations on structured EHR from UK Biobank and MIMIC IV demonstrate that KEEP outperforms both traditional and Language Model based approaches in capturing semantic relationships and predicting clinical outcomes. Moreover, KEEP's minimal computational requirements make it particularly suitable for resource constrained environments.

Paper Structure

This paper contains 45 sections, 10 equations, 1 figure, 13 tables, 1 algorithm.

Figures (1)

  • Figure 1: Overview of KEEP's approach: (A) Generate random walks on knowledge graph. (B) Walks used to create initial embeddings whose dimensions align with the ontology. (C) A co-occurrence matrix is constructed from EHR data. (D) GloVe model is initialized with the embeddings from (B) and regularized to incorporate empirical relationships from (C) while preserving ontologically-aligned dimensions. That is embeddings are adjusted based on the strength of observed associations.