Table of Contents
Fetching ...

Large-Scale Knowledge Integration for Enhanced Molecular Property Prediction

Yasir Ghunaim, Robert Hoehndorf

TL;DR

This work expands upon KANO by integrating the large-scale ChEBI knowledge graph, which includes 2,840 functional groups -- significantly more than the original 82 used in KANO -- and demonstrates that including ChEBI leads to improved performance on 9 out of 14 molecular property prediction datasets.

Abstract

Pre-training machine learning models on molecular properties has proven effective for generating robust and generalizable representations, which is critical for advancements in drug discovery and materials science. While recent work has primarily focused on data-driven approaches, the KANO model introduces a novel paradigm by incorporating knowledge-enhanced pre-training. In this work, we expand upon KANO by integrating the large-scale ChEBI knowledge graph, which includes 2,840 functional groups -- significantly more than the original 82 used in KANO. We explore two approaches, Replace and Integrate, to incorporate this extensive knowledge into the KANO framework. Our results demonstrate that including ChEBI leads to improved performance on 9 out of 14 molecular property prediction datasets. This highlights the importance of utilizing a larger and more diverse set of functional groups to enhance molecular representations for property predictions. Code: github.com/Yasir-Ghunaim/KANO-ChEBI

Large-Scale Knowledge Integration for Enhanced Molecular Property Prediction

TL;DR

This work expands upon KANO by integrating the large-scale ChEBI knowledge graph, which includes 2,840 functional groups -- significantly more than the original 82 used in KANO -- and demonstrates that including ChEBI leads to improved performance on 9 out of 14 molecular property prediction datasets.

Abstract

Pre-training machine learning models on molecular properties has proven effective for generating robust and generalizable representations, which is critical for advancements in drug discovery and materials science. While recent work has primarily focused on data-driven approaches, the KANO model introduces a novel paradigm by incorporating knowledge-enhanced pre-training. In this work, we expand upon KANO by integrating the large-scale ChEBI knowledge graph, which includes 2,840 functional groups -- significantly more than the original 82 used in KANO. We explore two approaches, Replace and Integrate, to incorporate this extensive knowledge into the KANO framework. Our results demonstrate that including ChEBI leads to improved performance on 9 out of 14 molecular property prediction datasets. This highlights the importance of utilizing a larger and more diverse set of functional groups to enhance molecular representations for property predictions. Code: github.com/Yasir-Ghunaim/KANO-ChEBI

Paper Structure

This paper contains 8 sections, 2 figures, 2 tables.

Figures (2)

  • Figure 1: (a) Original ElementKG structure utilized by KANO. Elements and functional groups derived from the periodic table and Wikipedia. (b) Enhancing ElementKG with ChEBI. Our methodology involves extracting functional groups from ChEBI and incorporating them into ElementKG using Replace or Integrate operations. The Replace operation removes the FunctionalGroup subgraph and replaces it with ChEBI groups, while the Integrate operation adds new ChEBI groups without removing existing data. Finally, we define relations between ChEBI groups and entities in the Element subgraph of ElementKG.
  • Figure 2: Histogram comparing functional group matches for KANO using ElementKG versus ChEBI. The histograms display the distribution of functional group matches in different datasets (ClinTox, BACE, MUV, HIV) when using ElementKG and ChEBI. Notably, ChEBI's larger functional group set results in a significantly higher number of matches across all datasets, illustrating its potential for more detailed molecular characterization.