A Survey of Deep Learning Methods in Protein Bioinformatics and its Impact on Protein Design
Weihang Dai
TL;DR
The paper surveys how deep learning methods are applied across three core areas of protein bioinformatics: structural prediction, functional prediction, and protein design. It highlights milestones such as AlphaFold2's end-to-end structure prediction and surveys a range of architectures from Transformers to Graph Neural Networks, examining both sequence- and structure-based approaches. It discusses the progress, challenges, and opportunities in translating advances in structure and function prediction into robust, design-focused capabilities, including evaluation hurdles and the need for rapid experimental validation. Overall, the work frames DL as a key driver in protein science with significant potential for drug design, disease understanding, and bioengineering, while recognizing that protein design remains the most demanding of the three problem classes and ripe for future breakthroughs.
Abstract
Proteins are sequences of amino acids that serve as the basic building blocks of living organisms. Despite rapidly growing databases documenting structural and functional information for various protein sequences, our understanding of proteins remains limited because of the large possible sequence space and the complex inter- and intra-molecular forces. Deep learning, which is characterized by its ability to learn relevant features directly from large datasets, has demonstrated remarkable performance in fields such as computer vision and natural language processing. It has also been increasingly applied in recent years to the data-rich domain of protein sequences with great success, most notably with Alphafold2's breakout performance in the protein structure prediction. The performance improvements achieved by deep learning unlocks new possibilities in the field of protein bioinformatics, including protein design, one of the most difficult but useful tasks. In this paper, we broadly categorize problems in protein bioinformatics into three main categories: 1) structural prediction, 2) functional prediction, and 3) protein design, and review the progress achieved from using deep learning methodologies in each of them. We expand on the main challenges of the protein design problem and highlight how advances in structural and functional prediction have directly contributed to design tasks. Finally, we conclude by identifying important topics and future research directions.
