Integrating Heterogeneous Gene Expression Data through Knowledge Graphs for Improving Diabetes Prediction

Rita T. Sousa; Heiko Paulheim

Integrating Heterogeneous Gene Expression Data through Knowledge Graphs for Improving Diabetes Prediction

Rita T. Sousa, Heiko Paulheim

TL;DR

The paper tackles diabetes prediction using gene expression data by addressing small sample sizes and cross-study heterogeneity with a knowledge-graph framework. It constructs a biomedical KG that fuses multiple expression datasets with domain knowledge (GO, GOA, STRING) and learns patient representations via RDF2Vec embeddings, which are then used for binary diabetes classification. Empirical results show that KG-based integration improves predictive performance across multiple metrics, with the most gains from representing patients through weighted gene embeddings and leveraging domain knowledge; naïve data merging can introduce noise. The approach is extensible to other diseases and aligns with future work in graph neural networks to directly operate on the KG for prediction.

Abstract

Diabetes is a worldwide health issue affecting millions of people. Machine learning methods have shown promising results in improving diabetes prediction, particularly through the analysis of diverse data types, namely gene expression data. While gene expression data can provide valuable insights, challenges arise from the fact that the sample sizes in expression datasets are usually limited, and the data from different datasets with different gene expressions cannot be easily combined. This work proposes a novel approach to address these challenges by integrating multiple gene expression datasets and domain-specific knowledge using knowledge graphs, a unique tool for biomedical data integration. KG embedding methods are then employed to generate vector representations, serving as inputs for a classifier. Experiments demonstrated the efficacy of our approach, revealing improvements in diabetes prediction when integrating multiple gene expression datasets and domain-specific knowledge about protein functions and interactions.

Integrating Heterogeneous Gene Expression Data through Knowledge Graphs for Improving Diabetes Prediction

TL;DR

Abstract

Paper Structure (11 sections, 4 figures, 3 tables)

This paper contains 11 sections, 4 figures, 3 tables.

Motivation
Related Work
Methodology
Expression Data
Building the Knowledge Graph
Learning Patient Representations
Predicting Diabetes
Evaluation
Data
Results and Discussion
Conclusion

Figures (4)

Figure 1: Overview of the proposed methodology with the main steps: building the KG, learning patient representations and predicting diabetes.
Figure 2: Schema of the two types of data sources and how they are integrated into the KG.
Figure 3: Experimental strategy to split the GSE30208 dataset and enrich with data from the GSE15932 and GSE55098 datasets.
Figure 4: Performance comparison between using a KG with domain knowledge and without domain knowledge generated with two approaches: binning and patient-gene links. Acc stands for accuracy, Pr stands for precision, Re stands for recall, F1 stands for f-measure, WAF stands for weighted average f-measure, and AUC stands for area under the ROC curve.

Integrating Heterogeneous Gene Expression Data through Knowledge Graphs for Improving Diabetes Prediction

TL;DR

Abstract

Integrating Heterogeneous Gene Expression Data through Knowledge Graphs for Improving Diabetes Prediction

Authors

TL;DR

Abstract

Table of Contents

Figures (4)