Integrating Heterogeneous Gene Expression Data through Knowledge Graphs for Improving Diabetes Prediction
Rita T. Sousa, Heiko Paulheim
TL;DR
The paper tackles diabetes prediction using gene expression data by addressing small sample sizes and cross-study heterogeneity with a knowledge-graph framework. It constructs a biomedical KG that fuses multiple expression datasets with domain knowledge (GO, GOA, STRING) and learns patient representations via RDF2Vec embeddings, which are then used for binary diabetes classification. Empirical results show that KG-based integration improves predictive performance across multiple metrics, with the most gains from representing patients through weighted gene embeddings and leveraging domain knowledge; naïve data merging can introduce noise. The approach is extensible to other diseases and aligns with future work in graph neural networks to directly operate on the KG for prediction.
Abstract
Diabetes is a worldwide health issue affecting millions of people. Machine learning methods have shown promising results in improving diabetes prediction, particularly through the analysis of diverse data types, namely gene expression data. While gene expression data can provide valuable insights, challenges arise from the fact that the sample sizes in expression datasets are usually limited, and the data from different datasets with different gene expressions cannot be easily combined. This work proposes a novel approach to address these challenges by integrating multiple gene expression datasets and domain-specific knowledge using knowledge graphs, a unique tool for biomedical data integration. KG embedding methods are then employed to generate vector representations, serving as inputs for a classifier. Experiments demonstrated the efficacy of our approach, revealing improvements in diabetes prediction when integrating multiple gene expression datasets and domain-specific knowledge about protein functions and interactions.
