Table of Contents
Fetching ...

Functional Geometry Guided Protein Sequence and Backbone Structure Co-Design

Zhenqiao Song, Yunlong Zhao, Wenxian Shi, Yang Yang, Lei Li

TL;DR

NAEPro introduces a motif-guided, SE(3)-equivariant co-design framework that jointly designs protein sequences and backbone structures by leveraging an interleaved stack of neighborhood attentive equivariant layers (NAELs). Each NAEL combines a global Transformer-style attention mechanism with a neighborhood-based, SE(3)-equivariant update to propagate information across the whole sequence and through local 3D neighborhoods, enabling efficient one-shot design of all residues. The method uses meaningful protein fragments mined from MSAs to guide functionally relevant design and employs a joint likelihood objective with coordinate regression to produce coherent sequences and structures. Across β-lactamase and myoglobin metalloprotein datasets, NAEPro achieves state-of-the-art docking performance, generates diverse and novel sequences with plausible active-site environments, and runs significantly faster than baselines, highlighting its potential for rapid, function-aware protein design. Limitations include the lack of wet-lab validation, which the authors plan to address in future work.

Abstract

Proteins are macromolecules responsible for essential functions in almost all living organisms. Designing reasonable proteins with desired functions is crucial. A protein's sequence and structure are strongly correlated and they together determine its function. In this paper, we propose NAEPro, a model to jointly design Protein sequence and structure based on automatically detected functional sites. NAEPro is powered by an interleaving network of attention and equivariant layers, which can capture global correlation in a whole sequence and local influence from nearest amino acids in three dimensional (3D) space. Such an architecture facilitates effective yet economic message passing at two levels. We evaluate our model and several strong baselines on two protein datasets, $β$-lactamase and myoglobin. Experimental results show that our model consistently achieves the highest amino acid recovery rate, TM-score, and the lowest RMSD among all competitors. These findings prove the capability of our model to design protein sequences and structures that closely resemble their natural counterparts. Furthermore, in-depth analysis further confirms our model's ability to generate highly effective proteins capable of binding to their target metallocofactors. We provide code, data and models in Github.

Functional Geometry Guided Protein Sequence and Backbone Structure Co-Design

TL;DR

NAEPro introduces a motif-guided, SE(3)-equivariant co-design framework that jointly designs protein sequences and backbone structures by leveraging an interleaved stack of neighborhood attentive equivariant layers (NAELs). Each NAEL combines a global Transformer-style attention mechanism with a neighborhood-based, SE(3)-equivariant update to propagate information across the whole sequence and through local 3D neighborhoods, enabling efficient one-shot design of all residues. The method uses meaningful protein fragments mined from MSAs to guide functionally relevant design and employs a joint likelihood objective with coordinate regression to produce coherent sequences and structures. Across β-lactamase and myoglobin metalloprotein datasets, NAEPro achieves state-of-the-art docking performance, generates diverse and novel sequences with plausible active-site environments, and runs significantly faster than baselines, highlighting its potential for rapid, function-aware protein design. Limitations include the lack of wet-lab validation, which the authors plan to address in future work.

Abstract

Proteins are macromolecules responsible for essential functions in almost all living organisms. Designing reasonable proteins with desired functions is crucial. A protein's sequence and structure are strongly correlated and they together determine its function. In this paper, we propose NAEPro, a model to jointly design Protein sequence and structure based on automatically detected functional sites. NAEPro is powered by an interleaving network of attention and equivariant layers, which can capture global correlation in a whole sequence and local influence from nearest amino acids in three dimensional (3D) space. Such an architecture facilitates effective yet economic message passing at two levels. We evaluate our model and several strong baselines on two protein datasets, -lactamase and myoglobin. Experimental results show that our model consistently achieves the highest amino acid recovery rate, TM-score, and the lowest RMSD among all competitors. These findings prove the capability of our model to design protein sequences and structures that closely resemble their natural counterparts. Furthermore, in-depth analysis further confirms our model's ability to generate highly effective proteins capable of binding to their target metallocofactors. We provide code, data and models in Github.
Paper Structure (29 sections, 2 theorems, 14 equations, 7 figures, 5 tables)

This paper contains 29 sections, 2 theorems, 14 equations, 7 figures, 5 tables.

Key Result

Theorem 3.1

Let R denotes a rotation matrix from SO(3) group and $\boldsymbol{t}\in \mathbb{R}^{3}$ from the translation group. Our NAEL is $SE(3)$-equivariant: $\boldsymbol{H^{l+1}}, R\boldsymbol{x^{l+1}}+\boldsymbol{t} = \mathrm{NAEL\xspace}(\boldsymbol{H}^{l}, R\boldsymbol{x^{l}}+\boldsymbol{t})$.

Figures (7)

  • Figure 1: NAEPro architecture, which consists of $L$ stacked neighborhood attentive equivariant layers (NAELs). Each NAEL is composed of one global attention sub-layer and one neighborhood equivariant sub-layer.
  • Figure 2: Neighborhood equivariant sub-layer: neighborhood message update (green), coordinate update (blue) and residue update (red). Indexed selection is choosing $\boldsymbol{x}_j$ (or $\boldsymbol{h}_j$) where $j^{th}$ residue is in the k-nearest neighbors (kNN) of $i^{th}$ residue.
  • Figure 3: Visualization of (a) inference speed of all models evaluated by average design time on test set. (b) and (c) model performance on myoglobin under different MSA identity thresholds.
  • Figure 4: Designed $\beta$-lactamases belonging to different subclasses: (a) B1, (b) B2 and (c) B3 metallo-$\beta$-lactamases.
  • Figure 5: Designed myoglobins binding heme ligand at (a) 92-Histidine and (b) 89-Histidine, and respectively have a low amino acid identity rate of $66.0\%$ and $26.7\%$ to the most similar one in Uniprot ((c) PDB id=1SPG) but also with a low RMSD distance of (a) $0.458$ Å and (b) $3.943$ Å.
  • ...and 2 more figures

Theorems & Definitions (3)

  • Theorem 3.1
  • Corollary 3.2
  • proof