Generative Enzyme Design Guided by Functionally Important Sites and Small-Molecule Substrates
Zhenqiao Song, Yunlong Zhao, Wenxian Shi, Wengong Jin, Yang Yang, Lei Li
TL;DR
EnzyGen presents a unified generative framework for designing enzyme sequence and backbone structure across thousands of families, guided by automatically mined functionally important sites and small-molecule substrates. By integrating NAELs that couple global sequence context with local 3D geometry and a substrate representation pathway, the approach jointly optimizes amino acid types, coordinates, and substrate-binding propensity. The EnzyBench dataset enables broad, family-spanning evaluation, where EnzyGen achieves higher ESP scores, stronger substrate binding affinities, and robust folding (pLDDT) compared with specialized baselines. Zero-shot tests and case studies demonstrate generalization to unseen functions and novel enzyme designs, underscoring the method's potential for rapid functional enzyme exploration. Overall, EnzyGen advances enzyme design by unifying multiple design objectives in a single scalable framework with strong practical performance metrics.
Abstract
Enzymes are genetically encoded biocatalysts capable of accelerating chemical reactions. How can we automatically design functional enzymes? In this paper, we propose EnzyGen, an approach to learn a unified model to design enzymes across all functional families. Our key idea is to generate an enzyme's amino acid sequence and their three-dimensional (3D) coordinates based on functionally important sites and substrates corresponding to a desired catalytic function. These sites are automatically mined from enzyme databases. EnzyGen consists of a novel interleaving network of attention and neighborhood equivariant layers, which captures both long-range correlation in an entire protein sequence and local influence from nearest amino acids in 3D space. To learn the generative model, we devise a joint training objective, including a sequence generation loss, a position prediction loss and an enzyme-substrate interaction loss. We further construct EnzyBench, a dataset with 3157 enzyme families, covering all available enzymes within the protein data bank (PDB). Experimental results show that our EnzyGen consistently achieves the best performance across all 323 testing families, surpassing the best baseline by 10.79% in terms of substrate binding affinity. These findings demonstrate EnzyGen's superior capability in designing well-folded and effective enzymes binding to specific substrates with high affinities.
