Table of Contents
Fetching ...

VN-EGNN: E(3)-Equivariant Graph Neural Networks with Virtual Nodes Enhance Protein Binding Site Identification

Florian Sestak, Lisa Schneckenreiter, Johannes Brandstetter, Sepp Hochreiter, Andreas Mayr, Günter Klambauer

TL;DR

This work tackles binding site identification in proteins using 3D structure data. It introduces VN-EGNN, an $E(3)$-equivariant graph neural network that augments EGNNs with $K$ virtual nodes connected to all physical nodes and a three-phase message passing scheme to learn hidden geometric entities like binding-site centers. The model optimizes a joint objective combining binding-site center loss $L_{\mathrm{bsc}}$ and segmentation loss $L_{\mathrm{dice}}$, and employs a self-confidence module, yielding state-of-the-art performance on COACH420, HOLO4K, and PDBbind2020. By predicting binding-site centers directly and leveraging virtual-node representations, VN-EGNN provides accurate, interpretable proposals for ligand binding regions while maintaining computational efficiency on residue-level graphs; this advances structure-based drug design and could scale to large structure databases such as AlphaFold predictions.

Abstract

Being able to identify regions within or around proteins, to which ligands can potentially bind, is an essential step to develop new drugs. Binding site identification methods can now profit from the availability of large amounts of 3D structures in protein structure databases or from AlphaFold predictions. Current binding site identification methods heavily rely on graph neural networks (GNNs), usually designed to output E(3)-equivariant predictions. Such methods turned out to be very beneficial for physics-related tasks like binding energy or motion trajectory prediction. However, the performance of GNNs at binding site identification is still limited potentially due to the lack of dedicated nodes that model hidden geometric entities, such as binding pockets. In this work, we extend E(n)-Equivariant Graph Neural Networks (EGNNs) by adding virtual nodes and applying an extended message passing scheme. The virtual nodes in these graphs are dedicated quantities to learn representations of binding sites, which leads to improved predictive performance. In our experiments, we show that our proposed method VN-EGNN sets a new state-of-the-art at locating binding site centers on COACH420, HOLO4K and PDBbind2020.

VN-EGNN: E(3)-Equivariant Graph Neural Networks with Virtual Nodes Enhance Protein Binding Site Identification

TL;DR

This work tackles binding site identification in proteins using 3D structure data. It introduces VN-EGNN, an -equivariant graph neural network that augments EGNNs with virtual nodes connected to all physical nodes and a three-phase message passing scheme to learn hidden geometric entities like binding-site centers. The model optimizes a joint objective combining binding-site center loss and segmentation loss , and employs a self-confidence module, yielding state-of-the-art performance on COACH420, HOLO4K, and PDBbind2020. By predicting binding-site centers directly and leveraging virtual-node representations, VN-EGNN provides accurate, interpretable proposals for ligand binding regions while maintaining computational efficiency on residue-level graphs; this advances structure-based drug design and could scale to large structure databases such as AlphaFold predictions.

Abstract

Being able to identify regions within or around proteins, to which ligands can potentially bind, is an essential step to develop new drugs. Binding site identification methods can now profit from the availability of large amounts of 3D structures in protein structure databases or from AlphaFold predictions. Current binding site identification methods heavily rely on graph neural networks (GNNs), usually designed to output E(3)-equivariant predictions. Such methods turned out to be very beneficial for physics-related tasks like binding energy or motion trajectory prediction. However, the performance of GNNs at binding site identification is still limited potentially due to the lack of dedicated nodes that model hidden geometric entities, such as binding pockets. In this work, we extend E(n)-Equivariant Graph Neural Networks (EGNNs) by adding virtual nodes and applying an extended message passing scheme. The virtual nodes in these graphs are dedicated quantities to learn representations of binding sites, which leads to improved predictive performance. In our experiments, we show that our proposed method VN-EGNN sets a new state-of-the-art at locating binding site centers on COACH420, HOLO4K and PDBbind2020.
Paper Structure (35 sections, 3 theorems, 26 equations, 7 figures, 6 tables)

This paper contains 35 sections, 3 theorems, 26 equations, 7 figures, 6 tables.

Key Result

Proposition 1

E$(3)$-equivariant graph neural networks with virtual nodes as defined in eq:message1eq:update3 are equivariant with respect to roto-translations and reflections of the input and virtual node coordinates.

Figures (7)

  • Figure 1: Overview of binding site identification methods. Top Left: Traditional methods, based on segmentation of a voxel grid, in which the pocket center is calculated as the geometric center of the positively labeled voxels. Bottom Left: Geometric Deep Learning approaches, such as EGNN, in which the pocket center is calculated as the geometric center of the positively labeled nodes. Right: VN-EGNN approach (ours): the predicted binding site center is the position of the virtual node after $L$ message passing layers.
  • Figure 2: Left: Example of a prediction from our model: Initial positions of the virtual nodes are represented by the yellow spheres around the protein, the ground truth binding site is indicated by the light violet ligand, whereas violet regions on the protein represent the annotated binding site. The arrows indicate how the positions of the virtual nodes change from their initial positions. The violet spheres represent the clustered virtual node predictions with their associated self-confidence score. To simplify the visualization, not all initial positions of the virtual nodes are depicted in the figure. Right: DCC values for different thresholds. The x-axis denotes different thresholds for the distance of the predicted and known binding pocket center in Å. Distances below this threshold are considered as correctly found binding pockets. The y-axis denotes DCC success rate.
  • Figure 61: Validation curves of a VN-EGNN during development. Despite only being trained to minimize the segmentation loss, the virtual nodes converged towards the known binding sites. Left: DCC success rate during training. Middle: DCA success rate during training. Right: Segmentation loss during training.
  • Figure 81: Examples of Detected Binding Sites: Visualization and Analysis. We visualized two distinct proteins using Pymol, where the initial positions of the virtual nodes are represented by yellow spheres, were the violet spheres indicate the virtual nodes following $L$ message passing steps. The violet molecules indicate the position of the ligand as in the original PDB file. The arrows indicate the starting positions and the predicted positions of the virtual nodes. The visualization demonstrates that our model distributes the virtual nodes amongst various possible binding positions. The visualizations show the predicted positions after applying clustering as described in Section \ref{['subsec:impl_details']}. To simplify the visualization, not all initial positions of the virtual nodes are depicted in the figure. Left:1odi Right: 3lpk.
  • Figure 82: T-SNE embeddings of virtual node features of the best ranked pockets for each protein in the PDBbind2020 dataset colored by protein family according to ChEMBL. The eight largest protein classes are shown, remaining proteins are colored in grey.
  • ...and 2 more figures

Theorems & Definitions (6)

  • Proposition 1
  • proof
  • Proposition 1
  • proof
  • Proposition 2
  • proof