Table of Contents
Fetching ...

Generative Enzyme Design Guided by Functionally Important Sites and Small-Molecule Substrates

Zhenqiao Song, Yunlong Zhao, Wenxian Shi, Wengong Jin, Yang Yang, Lei Li

TL;DR

EnzyGen presents a unified generative framework for designing enzyme sequence and backbone structure across thousands of families, guided by automatically mined functionally important sites and small-molecule substrates. By integrating NAELs that couple global sequence context with local 3D geometry and a substrate representation pathway, the approach jointly optimizes amino acid types, coordinates, and substrate-binding propensity. The EnzyBench dataset enables broad, family-spanning evaluation, where EnzyGen achieves higher ESP scores, stronger substrate binding affinities, and robust folding (pLDDT) compared with specialized baselines. Zero-shot tests and case studies demonstrate generalization to unseen functions and novel enzyme designs, underscoring the method's potential for rapid functional enzyme exploration. Overall, EnzyGen advances enzyme design by unifying multiple design objectives in a single scalable framework with strong practical performance metrics.

Abstract

Enzymes are genetically encoded biocatalysts capable of accelerating chemical reactions. How can we automatically design functional enzymes? In this paper, we propose EnzyGen, an approach to learn a unified model to design enzymes across all functional families. Our key idea is to generate an enzyme's amino acid sequence and their three-dimensional (3D) coordinates based on functionally important sites and substrates corresponding to a desired catalytic function. These sites are automatically mined from enzyme databases. EnzyGen consists of a novel interleaving network of attention and neighborhood equivariant layers, which captures both long-range correlation in an entire protein sequence and local influence from nearest amino acids in 3D space. To learn the generative model, we devise a joint training objective, including a sequence generation loss, a position prediction loss and an enzyme-substrate interaction loss. We further construct EnzyBench, a dataset with 3157 enzyme families, covering all available enzymes within the protein data bank (PDB). Experimental results show that our EnzyGen consistently achieves the best performance across all 323 testing families, surpassing the best baseline by 10.79% in terms of substrate binding affinity. These findings demonstrate EnzyGen's superior capability in designing well-folded and effective enzymes binding to specific substrates with high affinities.

Generative Enzyme Design Guided by Functionally Important Sites and Small-Molecule Substrates

TL;DR

EnzyGen presents a unified generative framework for designing enzyme sequence and backbone structure across thousands of families, guided by automatically mined functionally important sites and small-molecule substrates. By integrating NAELs that couple global sequence context with local 3D geometry and a substrate representation pathway, the approach jointly optimizes amino acid types, coordinates, and substrate-binding propensity. The EnzyBench dataset enables broad, family-spanning evaluation, where EnzyGen achieves higher ESP scores, stronger substrate binding affinities, and robust folding (pLDDT) compared with specialized baselines. Zero-shot tests and case studies demonstrate generalization to unseen functions and novel enzyme designs, underscoring the method's potential for rapid functional enzyme exploration. Overall, EnzyGen advances enzyme design by unifying multiple design objectives in a single scalable framework with strong practical performance metrics.

Abstract

Enzymes are genetically encoded biocatalysts capable of accelerating chemical reactions. How can we automatically design functional enzymes? In this paper, we propose EnzyGen, an approach to learn a unified model to design enzymes across all functional families. Our key idea is to generate an enzyme's amino acid sequence and their three-dimensional (3D) coordinates based on functionally important sites and substrates corresponding to a desired catalytic function. These sites are automatically mined from enzyme databases. EnzyGen consists of a novel interleaving network of attention and neighborhood equivariant layers, which captures both long-range correlation in an entire protein sequence and local influence from nearest amino acids in 3D space. To learn the generative model, we devise a joint training objective, including a sequence generation loss, a position prediction loss and an enzyme-substrate interaction loss. We further construct EnzyBench, a dataset with 3157 enzyme families, covering all available enzymes within the protein data bank (PDB). Experimental results show that our EnzyGen consistently achieves the best performance across all 323 testing families, surpassing the best baseline by 10.79% in terms of substrate binding affinity. These findings demonstrate EnzyGen's superior capability in designing well-folded and effective enzymes binding to specific substrates with high affinities.
Paper Structure (33 sections, 16 equations, 6 figures, 19 tables)

This paper contains 33 sections, 16 equations, 6 figures, 19 tables.

Figures (6)

  • Figure 1: (a) EnzyGen architecture, consisting of an enzyme modeling module (left) and a substrate representation module (right). The enzyme modeling module aims to generate the enzyme sequence and backbone structure, and the substrate representation module targets at predicting if an enzyme can bind to a substrate. The dashed box in enzyme input denotes functionally important sites, while other sites need to be generated. [M] denotes mask token. "1.1.1.1" denotes the fourth-level enzyme class in the BRENDA enzyme classification (EC) tree. (b) Neighborhood equivariant layer: neighborhood message update (in green), neighborhood coordinate update (in blue) and neighborhood node feature update (in red). Indexed selection is choosing $\boldsymbol{x}_j$ (or $\boldsymbol{h}_j$) where $j^{th}$ residue is in the $K$-nearest neighbors of $i^{th}$ residue.
  • Figure 2: Discoverying functionally important sites. Each row is a protein sequence in a same enzyme family (BRENDA fourth-level enzyme classification tree category). We use ClustalW2 to perform multiple sequence alignment and select common residuals above the identity threshold $\tau$. According to the aligned sequences, E, G and M are common in all the sequences therefore these are selected as important sites. In experiment, $\tau=30\%$.
  • Figure 3: (a) Ablation study comparing EnzyGen against ESM2+EGNN on ESP score. (b) Ablation study comparing EnzyGen against ESM2+EGNN on AlphaFold2 pLDDT. (c) Ablation study comparing EnzyGen against the model removing enzyme-substrate interaction constraint (EnzyGen-w/o-sub). (d) Ablation study on different model scales.
  • Figure 4: (a) ESP scores of designed enzymes from new fourth-level classes or with new substrates. Dash line denotes median. (b) Fourth-level category embedding clustering. Case Study: (c) Complex of designed 1KAG (2.7.1.71, catalyzing the specific phosphorylation of the 3-hydroxyl group of shikimic acid) and substrate ATP(-4), with pLDDT=90.39, Uniprot blastp recovery rate = 58.5%, (d) Complex of designed 5L2P (3.1.1.2, hydrolyzing various p-nitrophenyl phosphates, aromatic esters and p-nitrophenyl fatty acids) and substrate paraoxon, with pLDDT=89.44, Uniprot blastp recovery rate = 49.4%. Both cases show polar contacts (hydrogen bonds) depicted in purple.
  • Figure 5: Enzyme Classification (EC) Tree in BRENDA.
  • ...and 1 more figures

Theorems & Definitions (1)

  • proof