Table of Contents
Fetching ...

Investigating Graph Neural Networks and Classical Feature-Extraction Techniques in Activity-Cliff and Molecular Property Prediction

Markus Dablander

TL;DR

This work systematically explore and further develop classical and graph-based molecular featurisation methods for two important tasks: molecular property prediction, in particular, quantitative structure-activity relationship (QSAR) prediction, and the largely unexplored challenge of activity-cliff (AC) prediction.

Abstract

Molecular featurisation refers to the transformation of molecular data into numerical feature vectors. It is one of the key research areas in molecular machine learning and computational drug discovery. Recently, message-passing graph neural networks (GNNs) have emerged as a novel method to learn differentiable features directly from molecular graphs. While such techniques hold great promise, further investigations are needed to clarify if and when they indeed manage to definitively outcompete classical molecular featurisations such as extended-connectivity fingerprints (ECFPs) and physicochemical-descriptor vectors (PDVs). We systematically explore and further develop classical and graph-based molecular featurisation methods for two important tasks: molecular property prediction, in particular, quantitative structure-activity relationship (QSAR) prediction, and the largely unexplored challenge of activity-cliff (AC) prediction. We first give a technical description and critical analysis of PDVs, ECFPs and message-passing GNNs, with a focus on graph isomorphism networks (GINs). We then conduct a rigorous computational study to compare the performance of PDVs, ECFPs and GINs for QSAR and AC-prediction. Following this, we mathematically describe and computationally evaluate a novel twin neural network model for AC-prediction. We further introduce an operation called substructure pooling for the vectorisation of structural fingerprints as a natural counterpart to graph pooling in GNN architectures. We go on to propose Sort & Slice, a simple substructure-pooling technique for ECFPs that robustly outperforms hash-based folding at molecular property prediction. Finally, we outline two ideas for future research: (i) a graph-based self-supervised learning strategy to make classical molecular featurisations trainable, and (ii) trainable substructure-pooling via differentiable self-attention.

Investigating Graph Neural Networks and Classical Feature-Extraction Techniques in Activity-Cliff and Molecular Property Prediction

TL;DR

This work systematically explore and further develop classical and graph-based molecular featurisation methods for two important tasks: molecular property prediction, in particular, quantitative structure-activity relationship (QSAR) prediction, and the largely unexplored challenge of activity-cliff (AC) prediction.

Abstract

Molecular featurisation refers to the transformation of molecular data into numerical feature vectors. It is one of the key research areas in molecular machine learning and computational drug discovery. Recently, message-passing graph neural networks (GNNs) have emerged as a novel method to learn differentiable features directly from molecular graphs. While such techniques hold great promise, further investigations are needed to clarify if and when they indeed manage to definitively outcompete classical molecular featurisations such as extended-connectivity fingerprints (ECFPs) and physicochemical-descriptor vectors (PDVs). We systematically explore and further develop classical and graph-based molecular featurisation methods for two important tasks: molecular property prediction, in particular, quantitative structure-activity relationship (QSAR) prediction, and the largely unexplored challenge of activity-cliff (AC) prediction. We first give a technical description and critical analysis of PDVs, ECFPs and message-passing GNNs, with a focus on graph isomorphism networks (GINs). We then conduct a rigorous computational study to compare the performance of PDVs, ECFPs and GINs for QSAR and AC-prediction. Following this, we mathematically describe and computationally evaluate a novel twin neural network model for AC-prediction. We further introduce an operation called substructure pooling for the vectorisation of structural fingerprints as a natural counterpart to graph pooling in GNN architectures. We go on to propose Sort & Slice, a simple substructure-pooling technique for ECFPs that robustly outperforms hash-based folding at molecular property prediction. Finally, we outline two ideas for future research: (i) a graph-based self-supervised learning strategy to make classical molecular featurisations trainable, and (ii) trainable substructure-pooling via differentiable self-attention.

Paper Structure

This paper contains 71 sections, 7 theorems, 223 equations, 27 figures, 8 tables.

Key Result

Proposition 2.1

The GCN model falls into the class of message-passing GNNs, i.e. the atom feature vector updating process of GCNs can be mathematically expressed via the message-passing scheme described in Equations eq: mpnn_equations.

Figures (27)

  • Figure 1: Generation of a Simplified Molecular-Input Line-Entry System (SMILES) string from the molecular graph of the antibiotic molecule Ciprofloxacin. First the molecular graph is reduced to its hydrogen-depleted version. Then cycles are broken to turn the graph into a spanning tree. Finally, a depth-first traversal of the spanning tree (here starting with the leftmost nitrogen atom as a root) produces the SMILES string whereby branches are specified via parentheses. The integers in the SMILES string indicate which ring bonds were broken to produce the spanning tree and the equality signs indicate double bonds. Image source: smileswikifigure2023.
  • Figure 2: Circular subgraphs of varying radii for a central nitrogen atom in an example molecule.
  • Figure 3: Schematic overview of the molecular-featurisation mechanism of a message-passing graph neural network (GNN) with radius $R = 2$. All depicted functions may contain trainable deep-learning components.
  • Figure 4: Example of two non-isomorphic graphs that cannot be distinguished by the $1$-WL test if all nodes are assumed to have identical initial colourings. Image source: bouritsas2022improving.
  • Figure 5: Example of an activity cliff (AC) for blood coagulation factor Xa. A small structural change in the upper compound leads to an increase in binding affinity of almost three orders of magnitude. Here binding affinity is quantified via the commonly-used pKi-value, which represents the negative decadic logarithm of the dissociation constant Ki of the drug-target complex. Both compounds can be found in the same ChEMBL assay with ID 658338.
  • ...and 22 more figures

Theorems & Definitions (29)

  • Definition 2.1: Molecular Featurisation
  • Definition 2.2: Molecular Graph
  • Definition 2.3: Atom and Bond Feature Vectors
  • Definition 2.4: Physicochemical-Descriptor Vector
  • Example 2.1: Predicted $\log(P)$
  • Example 2.2: Balaban Index
  • Definition 2.5: Circular Subgraph
  • Proposition 2.1: Message-Passing for GCNs
  • proof
  • Theorem 2.1: GNN-Conditions for $1$-WL Power
  • ...and 19 more