Functional Geometry Guided Protein Sequence and Backbone Structure Co-Design
Zhenqiao Song, Yunlong Zhao, Wenxian Shi, Yang Yang, Lei Li
TL;DR
NAEPro introduces a motif-guided, SE(3)-equivariant co-design framework that jointly designs protein sequences and backbone structures by leveraging an interleaved stack of neighborhood attentive equivariant layers (NAELs). Each NAEL combines a global Transformer-style attention mechanism with a neighborhood-based, SE(3)-equivariant update to propagate information across the whole sequence and through local 3D neighborhoods, enabling efficient one-shot design of all residues. The method uses meaningful protein fragments mined from MSAs to guide functionally relevant design and employs a joint likelihood objective with coordinate regression to produce coherent sequences and structures. Across β-lactamase and myoglobin metalloprotein datasets, NAEPro achieves state-of-the-art docking performance, generates diverse and novel sequences with plausible active-site environments, and runs significantly faster than baselines, highlighting its potential for rapid, function-aware protein design. Limitations include the lack of wet-lab validation, which the authors plan to address in future work.
Abstract
Proteins are macromolecules responsible for essential functions in almost all living organisms. Designing reasonable proteins with desired functions is crucial. A protein's sequence and structure are strongly correlated and they together determine its function. In this paper, we propose NAEPro, a model to jointly design Protein sequence and structure based on automatically detected functional sites. NAEPro is powered by an interleaving network of attention and equivariant layers, which can capture global correlation in a whole sequence and local influence from nearest amino acids in three dimensional (3D) space. Such an architecture facilitates effective yet economic message passing at two levels. We evaluate our model and several strong baselines on two protein datasets, $β$-lactamase and myoglobin. Experimental results show that our model consistently achieves the highest amino acid recovery rate, TM-score, and the lowest RMSD among all competitors. These findings prove the capability of our model to design protein sequences and structures that closely resemble their natural counterparts. Furthermore, in-depth analysis further confirms our model's ability to generate highly effective proteins capable of binding to their target metallocofactors. We provide code, data and models in Github.
