A Survey of Deep Learning Methods in Protein Bioinformatics and its Impact on Protein Design

Weihang Dai

A Survey of Deep Learning Methods in Protein Bioinformatics and its Impact on Protein Design

Weihang Dai

TL;DR

The paper surveys how deep learning methods are applied across three core areas of protein bioinformatics: structural prediction, functional prediction, and protein design. It highlights milestones such as AlphaFold2's end-to-end structure prediction and surveys a range of architectures from Transformers to Graph Neural Networks, examining both sequence- and structure-based approaches. It discusses the progress, challenges, and opportunities in translating advances in structure and function prediction into robust, design-focused capabilities, including evaluation hurdles and the need for rapid experimental validation. Overall, the work frames DL as a key driver in protein science with significant potential for drug design, disease understanding, and bioengineering, while recognizing that protein design remains the most demanding of the three problem classes and ripe for future breakthroughs.

Abstract

Proteins are sequences of amino acids that serve as the basic building blocks of living organisms. Despite rapidly growing databases documenting structural and functional information for various protein sequences, our understanding of proteins remains limited because of the large possible sequence space and the complex inter- and intra-molecular forces. Deep learning, which is characterized by its ability to learn relevant features directly from large datasets, has demonstrated remarkable performance in fields such as computer vision and natural language processing. It has also been increasingly applied in recent years to the data-rich domain of protein sequences with great success, most notably with Alphafold2's breakout performance in the protein structure prediction. The performance improvements achieved by deep learning unlocks new possibilities in the field of protein bioinformatics, including protein design, one of the most difficult but useful tasks. In this paper, we broadly categorize problems in protein bioinformatics into three main categories: 1) structural prediction, 2) functional prediction, and 3) protein design, and review the progress achieved from using deep learning methodologies in each of them. We expand on the main challenges of the protein design problem and highlight how advances in structural and functional prediction have directly contributed to design tasks. Finally, we conclude by identifying important topics and future research directions.

A Survey of Deep Learning Methods in Protein Bioinformatics and its Impact on Protein Design

TL;DR

Abstract

Paper Structure (32 sections, 4 equations, 12 figures)

This paper contains 32 sections, 4 equations, 12 figures.

Introduction
Background
Key concepts of deep learning
Learning through gradient descent
Comparison with traditional modelling approaches
Common Deep Learning Architectures
Architectures for Geometric Deep Learning
Protein Bioinformatics
Protein composition and interaction
Key problems in protein bioinformatics
Structural prediction from sequence
Traditional methods and key ideas
Treating distance maps as a picture
Transformers and graphs
Others architectures
...and 17 more sections

Figures (12)

Figure 1: (a) Simplified representation of a MLP with n layers. Input and intermediate features are passed on along the network through matrix operations and non-linear activation functions. (b) A simplified 3D representation of stochastic gradient descent. Weights are updated such that the final values correspond to a minimum on the loss surface. Figure taken from amini2018spatial
Figure 2: A simplified overview of the main classes of GNNs. GNNs all involve a convolutional operator, $\psi$, a pooling operator, $\bigoplus$, and a non-linear activation function, $\phi$. Variations between them can be classified depending on the weighting methodology for features from neighbouring nodes. Figure taken from bronstein2021geometric
Figure 3: An Euclidian patch covers the region that is enclosed by a circular projection from a 2D Euclidian plane tangent to the surface. A geodesic patch covers the 2D region that is within some distance from a central point when measured along the surface. The example shows differences between the two patches on a protein structure with a deep pocket in the binding site. In the bottom left, the Euclidian patch covers a large region away from the tangent point that is irrelevant to the binding site. The Geodesic patch only covers the surface that is part of the binding site. Figure taken from gainza2020deciphering
Figure 4: Peptide bonds are formed between the Carbon and Nitrogen atoms in two amino-acids, releasing water in the process. The backbone of a protein structure is composed of these Carbon and Nitrogen atoms and is typically described by the 3D configuration of $C_{\alpha}$ atoms.
Figure 5: Simplified examples different kinds of protein binding. Surface-surface binding is the most common type. Figure taken from alberts_1970
...and 7 more figures

A Survey of Deep Learning Methods in Protein Bioinformatics and its Impact on Protein Design

TL;DR

Abstract

A Survey of Deep Learning Methods in Protein Bioinformatics and its Impact on Protein Design

Authors

TL;DR

Abstract

Table of Contents

Figures (12)