Annotation-guided Protein Design with Multi-Level Domain Alignment

Chaohao Yuan; Songyou Li; Geyan Ye; Yikun Zhang; Long-Kai Huang; Wenbing Huang; Wei Liu; Jianhua Yao; Yu Rong

Annotation-guided Protein Design with Multi-Level Domain Alignment

Chaohao Yuan, Songyou Li, Geyan Ye, Yikun Zhang, Long-Kai Huang, Wenbing Huang, Wei Liu, Jianhua Yao, Yu Rong

TL;DR

PAAG introduces annotation-guided protein design by aligning protein sequences with domain- and property-level textual annotations to enable controllable sequence-level generation. It employs a multi-level alignment strategy with local ADC and global APC losses, plus an annotation-protein matching objective, all trained end-to-end with a conditional autoregressive decoder. The approach yields improvements in predictive tasks (average ≈1.5% relative) and significant gains in domain- and property-conditioned design, achieving higher SR1 scores and enabling joint generation of proteins with multiple annotations. By leveraging rich textual annotations and a shared latent space, PAAG broadens the design space for functional proteins and offers a foundation for future sequence-structure co-design and larger, annotation-rich datasets.

Abstract

The core challenge of de novo protein design lies in creating proteins with specific functions or properties, guided by certain conditions. Current models explore to generate protein using structural and evolutionary guidance, which only provide indirect conditions concerning functions and properties. However, textual annotations of proteins, especially the annotations for protein domains, which directly describe the protein's high-level functionalities, properties, and their correlation with target amino acid sequences, remain unexplored in the context of protein design tasks. In this paper, we propose Protein-Annotation Alignment Generation, PAAG, a multi-modality protein design framework that integrates the textual annotations extracted from protein database for controllable generation in sequence space. Specifically, within a multi-level alignment module, PAAG can explicitly generate proteins containing specific domains conditioned on the corresponding domain annotations, and can even design novel proteins with flexible combinations of different kinds of annotations. Our experimental results underscore the superiority of the aligned protein representations from PAAG over 7 prediction tasks. Furthermore, PAAG demonstrates a significant increase in generation success rate (24.7% vs 4.7% in zinc finger, and 54.3% vs 22.0% in the immunoglobulin domain) in comparison to the existing model. We anticipate that PAAG will broaden the horizons of protein design by leveraging the knowledge from between textual annotation and proteins.

Annotation-guided Protein Design with Multi-Level Domain Alignment

TL;DR

Abstract

Paper Structure (34 sections, 10 equations, 7 figures, 11 tables)

This paper contains 34 sections, 10 equations, 7 figures, 11 tables.

Introduction
Preliminaries
Protein and Its Textual Annotations
Encoders for Proteins and Annotations
Methodology
Multi-level Protein and Annotation Alignment
Local Alignment
Global Alignment
Conditional Protein Decoding
Training Objectives
Annotation-guided Protein Design
Experiment
Construction of ProtAnnotation Dataset
Quality of Aligned Representation
Unconditional Protein Generation
...and 19 more sections

Figures (7)

Figure 1: (a) The example of property annotations (in bold) and domain annotations (in colors). (b). The illustration of annotation-guided protein design with PAAG. Given the input of textual description within immunoglobulin domain annotation, PAAG can generate the proteins containing immunoglobulin domain.
Figure 2: The overall framework of PAAG. The same parameters share the same color. PAAG contains three modules. (1) Protein & Annotation Encoding module encode the input protein sequence & domains and corresponding annotations to the embeddings. (2) Multi-level alignment module projects the protein and annotation embeddings into and employs Annotation-Protein Contrasive (APC) loss, Annotation-Domain Contrasive (ADC) loss and Annotation-Protein Matching (APM) loss to align them in a same latent space. (3) Conditional Protein Decoding accepts the annotation embedding as input and generate the protein sequence.
Figure 3: Figure (a) and (b) show the $\text{SR}_{e}$ on zinc-finger domain and immunoglobulin domain over all models. Figure (c) and (d) show their distributions of e-value. White bar indicates the mean e-value of each set. PAAG consistently exhibits better performance on all metrics compared with other models. Fine-tuning also introduces additional improvement for PAAG.
Figure 4: Visualization of the generated results on zinc-finger and immunoglobulin domain. The corresponding prompt and generation qualify (e-value) is listed below.
Figure 5: The relation of number specified in prompt with generated domains by PAAG.
...and 2 more figures

Theorems & Definitions (1)

Definition 3.1: Annotation-guided Protein Design

Annotation-guided Protein Design with Multi-Level Domain Alignment

TL;DR

Abstract

Annotation-guided Protein Design with Multi-Level Domain Alignment

Authors

TL;DR

Abstract

Table of Contents

Figures (7)

Theorems & Definitions (1)