Table of Contents
Fetching ...

Generative AI for Controllable Protein Sequence Design: A Survey

Yiheng Zhu, Zitai Kong, Jialu Wu, Weize Liu, Yuqiang Han, Mingze Yin, Hongxia Xu, Chang-Yu Hsieh, Tingjun Hou

TL;DR

This survey addresses the challenge of designing targeted protein sequences by leveraging generative AI and optimization to navigate the vast sequence space. It reviews distribution learning in local and global sequence spaces, emphasizing protein language models and diffusion models, and details structure-to-sequence and function-to-sequence design tasks. Key contributions include a taxonomy of controllable design tasks, a synthesis of autoregressive, one-shot, and iterative refinement approaches, and discussion of benchmarks and datasets. The review also identifies open challenges—data scarcity, interpretability, and evaluation—and outlines opportunities such as enhanced benchmarks and RL-guided PLMs for practical, controllable design.

Abstract

The design of novel protein sequences with targeted functionalities underpins a central theme in protein engineering, impacting diverse fields such as drug discovery and enzymatic engineering. However, navigating this vast combinatorial search space remains a severe challenge due to time and financial constraints. This scenario is rapidly evolving as the transformative advancements in AI, particularly in the realm of generative models and optimization algorithms, have been propelling the protein design field towards an unprecedented revolution. In this survey, we systematically review recent advances in generative AI for controllable protein sequence design. To set the stage, we first outline the foundational tasks in protein sequence design in terms of the constraints involved and present key generative models and optimization algorithms. We then offer in-depth reviews of each design task and discuss the pertinent applications. Finally, we identify the unresolved challenges and highlight research opportunities that merit deeper exploration.

Generative AI for Controllable Protein Sequence Design: A Survey

TL;DR

This survey addresses the challenge of designing targeted protein sequences by leveraging generative AI and optimization to navigate the vast sequence space. It reviews distribution learning in local and global sequence spaces, emphasizing protein language models and diffusion models, and details structure-to-sequence and function-to-sequence design tasks. Key contributions include a taxonomy of controllable design tasks, a synthesis of autoregressive, one-shot, and iterative refinement approaches, and discussion of benchmarks and datasets. The review also identifies open challenges—data scarcity, interpretability, and evaluation—and outlines opportunities such as enhanced benchmarks and RL-guided PLMs for practical, controllable design.

Abstract

The design of novel protein sequences with targeted functionalities underpins a central theme in protein engineering, impacting diverse fields such as drug discovery and enzymatic engineering. However, navigating this vast combinatorial search space remains a severe challenge due to time and financial constraints. This scenario is rapidly evolving as the transformative advancements in AI, particularly in the realm of generative models and optimization algorithms, have been propelling the protein design field towards an unprecedented revolution. In this survey, we systematically review recent advances in generative AI for controllable protein sequence design. To set the stage, we first outline the foundational tasks in protein sequence design in terms of the constraints involved and present key generative models and optimization algorithms. We then offer in-depth reviews of each design task and discuss the pertinent applications. Finally, we identify the unresolved challenges and highlight research opportunities that merit deeper exploration.
Paper Structure (32 sections, 4 equations, 2 figures)

This paper contains 32 sections, 4 equations, 2 figures.

Figures (2)

  • Figure 1: Illustration for controllable protein sequence design. (a) The upper diagram delineates the four integral tiers of the protein central dogma. Amino acid sequences fold to form specific protein structures, which determine protein functions. These varied functions integrate to perform higher-level actions. Then protein sequence design tasks can be bifurcated into structure-to-sequence, function-to-sequence, and higher-level sequence design. The lower diagram highlights that generative AI empowers designers to sidestep intricate intermediate steps associated with traditional methods, facilitating protein design in an end-to-end manner. (b) Related generative AI technologies primarily incorporate (i) deep generative models and (ii) optimization methods (see Section \ref{['sec:preliminaries']} for more details).
  • Figure 2: A taxonomy of protein sequence design methods with representative examples.