Table of Contents
Fetching ...

Knowledge-augmented Graph Machine Learning for Drug Discovery: A Survey

Zhiqiang Zhong, Anastasia Barkova, Davide Mottin

TL;DR

This survey defines Knowledge-augmented Graph Machine Learning (KaGML) for drug discovery, arguing that combining graph-based models with external biomedical knowledge yields more accurate and interpretable predictions, especially under limited data. It introduces a four-way taxonomy (preprocessing, pretraining, training, interpretability) for incorporating knowledge into GML and surveys a broad set of methods and resources, including knowledge graphs and domain databases. The paper also highlights practical biomedical resources, knowledge-graph architectures, and typical DD tasks benefiting from KaGML, while discussing remaining challenges such as data harmonisation, uncertainty, and scalability. Overall, KaGML is positioned as a promising framework to fuse structured biomedical knowledge with graph ML to accelerate precise, explainable drug discovery.

Abstract

The integration of Artificial Intelligence (AI) into the field of drug discovery has been a growing area of interdisciplinary scientific research. However, conventional AI models are heavily limited in handling complex biomedical structures (such as 2D or 3D protein and molecule structures) and providing interpretations for outputs, which hinders their practical application. As of late, Graph Machine Learning (GML) has gained considerable attention for its exceptional ability to model graph-structured biomedical data and investigate their properties and functional relationships. Despite extensive efforts, GML methods still suffer from several deficiencies, such as the limited ability to handle supervision sparsity and provide interpretability in learning and inference processes, and their ineffectiveness in utilising relevant domain knowledge. In response, recent studies have proposed integrating external biomedical knowledge into the GML pipeline to realise more precise and interpretable drug discovery with limited training instances. However, a systematic definition for this burgeoning research direction is yet to be established. This survey presents a comprehensive overview of long-standing drug discovery principles, provides the foundational concepts and cutting-edge techniques for graph-structured data and knowledge databases, and formally summarises Knowledge-augmented Graph Machine Learning (KaGML) for drug discovery. we propose a thorough review of related KaGML works, collected following a carefully designed search methodology, and organise them into four categories following a novel-defined taxonomy. To facilitate research in this promptly emerging field, we also share collected practical resources that are valuable for intelligent drug discovery and provide an in-depth discussion of the potential avenues for future advancements.

Knowledge-augmented Graph Machine Learning for Drug Discovery: A Survey

TL;DR

This survey defines Knowledge-augmented Graph Machine Learning (KaGML) for drug discovery, arguing that combining graph-based models with external biomedical knowledge yields more accurate and interpretable predictions, especially under limited data. It introduces a four-way taxonomy (preprocessing, pretraining, training, interpretability) for incorporating knowledge into GML and surveys a broad set of methods and resources, including knowledge graphs and domain databases. The paper also highlights practical biomedical resources, knowledge-graph architectures, and typical DD tasks benefiting from KaGML, while discussing remaining challenges such as data harmonisation, uncertainty, and scalability. Overall, KaGML is positioned as a promising framework to fuse structured biomedical knowledge with graph ML to accelerate precise, explainable drug discovery.

Abstract

The integration of Artificial Intelligence (AI) into the field of drug discovery has been a growing area of interdisciplinary scientific research. However, conventional AI models are heavily limited in handling complex biomedical structures (such as 2D or 3D protein and molecule structures) and providing interpretations for outputs, which hinders their practical application. As of late, Graph Machine Learning (GML) has gained considerable attention for its exceptional ability to model graph-structured biomedical data and investigate their properties and functional relationships. Despite extensive efforts, GML methods still suffer from several deficiencies, such as the limited ability to handle supervision sparsity and provide interpretability in learning and inference processes, and their ineffectiveness in utilising relevant domain knowledge. In response, recent studies have proposed integrating external biomedical knowledge into the GML pipeline to realise more precise and interpretable drug discovery with limited training instances. However, a systematic definition for this burgeoning research direction is yet to be established. This survey presents a comprehensive overview of long-standing drug discovery principles, provides the foundational concepts and cutting-edge techniques for graph-structured data and knowledge databases, and formally summarises Knowledge-augmented Graph Machine Learning (KaGML) for drug discovery. we propose a thorough review of related KaGML works, collected following a carefully designed search methodology, and organise them into four categories following a novel-defined taxonomy. To facilitate research in this promptly emerging field, we also share collected practical resources that are valuable for intelligent drug discovery and provide an in-depth discussion of the potential avenues for future advancements.
Paper Structure (20 sections, 3 equations, 8 figures, 10 tables)

This paper contains 20 sections, 3 equations, 8 figures, 10 tables.

Figures (8)

  • Figure 1: Illustration of real-world biomedical data in the form of graphs (a) and examples of human biomedical knowledge (b).
  • Figure 2: Toy examples of the graph and typical graph representation learning approaches. (a): A graph can be basically represented using a node attribute matrix $\mathbf{X}$ and an adjacency matrix $\mathbf{A}$. (b) Graph representation learning can convert a graph into a set of vectors, which record information about the graph. (c) A toy example of random walk-based shallow GRL approaches. (d) A toy example of GNN mechanism.
  • Figure 3: Illustration of example knowledge databases. (a) Metagraph of the Bioteque TFBLA22, showing all the entities and the most representative associations between them. (b) An example graph of polypharmacy side effects derived from genomic and patient population data ZAL18. More explanations about the biomedical terms will be presented in Section \ref{['sec:dd']}.
  • Figure 4: Illustration of the four paradigms in intelligent drug discovery.
  • Figure 5: Overview of various intelligent drug discovery approaches. (a) GML for drug discovery. Biomedical structured data is inputted into a GML model to predict properties or gain insights about the input entities. (b) KG for drug discovery. A KG database is embedded into an embedding space, enabling further analysis to identify any missing relationships between entities. (c) KaGML for drug discovery. Human domain knowledge is integrated into GML models to enhance the flexibility, precision, and interpretability of the drug discovery process.
  • ...and 3 more figures

Theorems & Definitions (7)

  • Definition 1: Graph
  • Definition 2: $\lambda$-hop Neighbourhood and Subgraph
  • Definition 3: Graph Representation Learning
  • Definition 4: Graph Machine Learning Training
  • Definition 5: Knowledge Database
  • Definition 6: Knowledge Graph
  • Definition 7: Knowledge Graph Representation Learning