Advanced atom-level representations for protein flexibility prediction utilizing graph neural networks

Sina Sarparast; Aldo Zaimi; Maximilian Ebert; Michael-Rock Goldsmith

Advanced atom-level representations for protein flexibility prediction utilizing graph neural networks

Sina Sarparast, Aldo Zaimi, Maximilian Ebert, Michael-Rock Goldsmith

TL;DR

This work proposes for the first time to use graph neural networks (GNNs) to learn protein representations at the atomic level and predict B-factors from protein 3D structures and demonstrates the potential of representations learned by GNNs for protein flexibility prediction and other related tasks.

Abstract

Protein dynamics play a crucial role in many biological processes and drug interactions. However, measuring, and simulating protein dynamics is challenging and time-consuming. While machine learning holds promise in deciphering the determinants of protein dynamics from structural information, most existing methods for protein representation learning operate at the residue level, ignoring the finer details of atomic interactions. In this work, we propose for the first time to use graph neural networks (GNNs) to learn protein representations at the atomic level and predict B-factors from protein 3D structures. The B-factor reflects the atomic displacement of atoms in proteins, and can serve as a surrogate for protein flexibility. We compared different GNN architectures to assess their performance. The Meta-GNN model achieves a correlation coefficient of 0.71 on a large and diverse test set of over 4k proteins (17M atoms) from the Protein Data Bank (PDB), outperforming previous methods by a large margin. Our work demonstrates the potential of representations learned by GNNs for protein flexibility prediction and other related tasks.

Advanced atom-level representations for protein flexibility prediction utilizing graph neural networks

TL;DR

Abstract

Paper Structure (41 sections, 26 equations, 6 figures, 7 tables)

This paper contains 41 sections, 26 equations, 6 figures, 7 tables.

Introduction
Protein representation learning
Small molecule representation learning
Protein B-factor prediction
Methods
Data Description
Graph Representation
Node Features
Atom type
Atom relative location
Atom degree
Residue type
Edge Features
Covalent bond type
Distance:
...and 26 more sections

Figures (6)

Figure 1: Protein 3D structure visualization (protein 1GA0 from the PDB shown here with its secondary structure components). The graph representation of the protein consists of nodes $v$ (atoms) and edges $e$ (covalent bonds between the atoms). The edge between two nodes $v_{i}$ and $v_{j}$ is defined as $e_{ij}$.
Figure 2: Overview of a general graph neural network (GNN) architecture for the task of node prediction. (a) Each block contains a learnable GNN layer that updates nodes and/or edges representations, an activation function and a regularization (e.g. dropout, batch normalization). Skip connections are also typically implemented between consecutive blocks to help the training and improve the performance of the models. The regression head is typically a sequence of multi-layer perceptron (MLP) layers that map the last hidden node embeddings to the final node predictions. (b) A protein with input node and/or edge features (top) is typically fed into the network and goes through layers that learn hidden latent representations of its nodes and/or edges (middle) relevant to the training task. The output is a protein with node predictions such as the B-factor (bottom). (c) A general-purpose GNN layer (the one proposed by meta is shown here as an example) is typically responsible for taking input matrices of node and/or edge embeddings $V$ and $E$, and generating updated embeddings $V'$ and/or $E'$. Depending on the choice of GNN layer, different mechanisms can be used to perform the aggregation and update operations.
Figure 3: Visualization of the Pearson correlation coefficient (CC) distributions of the proteins from the Kinase test set for each model. For each protein, the value reported is the average CC computed over the 3 runs done for each model. The blue box represents the interquartile range, the line in it is the median, and the dots are outliers.
Figure 4: Visualization of the B-factor predictions obtained on proteins 6I5I (CC of 0.88 & MAE of 6.78), 3A7J (CC of 0.85 & MAE of 4.74) and 5Y5T (CC of 0.84 & MAE of 10.68) from the test set using the Meta model. For each protein, the target B-factor values (ground truth), the prediction B-factor values and the errors (computed as $prediction - target$) are projected into the 3D structure. All values are scaled between -1 and 1. Arrows highlight a few differences that can be observed in outer regions.
Figure 5: Distributions of atomic B-factor targets (identified as true) and predictions (identified by their model names) from the trained models for a few proteins from the Kinase test dataset. For all plots, the y axis represents the B-factor values and the x axis represents the models/targets.
...and 1 more figures

Advanced atom-level representations for protein flexibility prediction utilizing graph neural networks

TL;DR

Abstract

Advanced atom-level representations for protein flexibility prediction utilizing graph neural networks

Authors

TL;DR

Abstract

Table of Contents

Figures (6)