Generative AI for Controllable Protein Sequence Design: A Survey
Yiheng Zhu, Zitai Kong, Jialu Wu, Weize Liu, Yuqiang Han, Mingze Yin, Hongxia Xu, Chang-Yu Hsieh, Tingjun Hou
TL;DR
This survey addresses the challenge of designing targeted protein sequences by leveraging generative AI and optimization to navigate the vast sequence space. It reviews distribution learning in local and global sequence spaces, emphasizing protein language models and diffusion models, and details structure-to-sequence and function-to-sequence design tasks. Key contributions include a taxonomy of controllable design tasks, a synthesis of autoregressive, one-shot, and iterative refinement approaches, and discussion of benchmarks and datasets. The review also identifies open challenges—data scarcity, interpretability, and evaluation—and outlines opportunities such as enhanced benchmarks and RL-guided PLMs for practical, controllable design.
Abstract
The design of novel protein sequences with targeted functionalities underpins a central theme in protein engineering, impacting diverse fields such as drug discovery and enzymatic engineering. However, navigating this vast combinatorial search space remains a severe challenge due to time and financial constraints. This scenario is rapidly evolving as the transformative advancements in AI, particularly in the realm of generative models and optimization algorithms, have been propelling the protein design field towards an unprecedented revolution. In this survey, we systematically review recent advances in generative AI for controllable protein sequence design. To set the stage, we first outline the foundational tasks in protein sequence design in terms of the constraints involved and present key generative models and optimization algorithms. We then offer in-depth reviews of each design task and discuss the pertinent applications. Finally, we identify the unresolved challenges and highlight research opportunities that merit deeper exploration.
