Table of Contents
Fetching ...

MSAGPT: Neural Prompting Protein Structure Prediction via MSA Generative Pre-Training

Bo Chen, Zhilei Bei, Xingyi Cheng, Pan Li, Jie Tang, Le Song

TL;DR

MSAGPT is introduced, a novel approach to prompt protein structure predictions via MSA generative pre-training in the low-MSA regime by employing a simple yet effective 2D evolutionary positional encoding scheme to model the complex evolutionary patterns.

Abstract

Multiple Sequence Alignment (MSA) plays a pivotal role in unveiling the evolutionary trajectories of protein families. The accuracy of protein structure predictions is often compromised for protein sequences that lack sufficient homologous information to construct high quality MSA. Although various methods have been proposed to generate virtual MSA under these conditions, they fall short in comprehensively capturing the intricate coevolutionary patterns within MSA or require guidance from external oracle models. Here we introduce MSAGPT, a novel approach to prompt protein structure predictions via MSA generative pretraining in the low MSA regime. MSAGPT employs a simple yet effective 2D evolutionary positional encoding scheme to model complex evolutionary patterns. Endowed by this, its flexible 1D MSA decoding framework facilitates zero or few shot learning. Moreover, we demonstrate that leveraging the feedback from AlphaFold2 can further enhance the model capacity via Rejective Fine tuning (RFT) and Reinforcement Learning from AF2 Feedback (RLAF). Extensive experiments confirm the efficacy of MSAGPT in generating faithful virtual MSA to enhance the structure prediction accuracy. The transfer learning capabilities also highlight its great potential for facilitating other protein tasks.

MSAGPT: Neural Prompting Protein Structure Prediction via MSA Generative Pre-Training

TL;DR

MSAGPT is introduced, a novel approach to prompt protein structure predictions via MSA generative pre-training in the low-MSA regime by employing a simple yet effective 2D evolutionary positional encoding scheme to model the complex evolutionary patterns.

Abstract

Multiple Sequence Alignment (MSA) plays a pivotal role in unveiling the evolutionary trajectories of protein families. The accuracy of protein structure predictions is often compromised for protein sequences that lack sufficient homologous information to construct high quality MSA. Although various methods have been proposed to generate virtual MSA under these conditions, they fall short in comprehensively capturing the intricate coevolutionary patterns within MSA or require guidance from external oracle models. Here we introduce MSAGPT, a novel approach to prompt protein structure predictions via MSA generative pretraining in the low MSA regime. MSAGPT employs a simple yet effective 2D evolutionary positional encoding scheme to model complex evolutionary patterns. Endowed by this, its flexible 1D MSA decoding framework facilitates zero or few shot learning. Moreover, we demonstrate that leveraging the feedback from AlphaFold2 can further enhance the model capacity via Rejective Fine tuning (RFT) and Reinforcement Learning from AF2 Feedback (RLAF). Extensive experiments confirm the efficacy of MSAGPT in generating faithful virtual MSA to enhance the structure prediction accuracy. The transfer learning capabilities also highlight its great potential for facilitating other protein tasks.
Paper Structure (33 sections, 5 equations, 15 figures, 10 tables)

This paper contains 33 sections, 5 equations, 15 figures, 10 tables.

Figures (15)

  • Figure 1: (a) The illustration of MSA and (b) performance comparisons between MSAGPT and advanced baselines on three natural MSA-scarce benchmark.
  • Figure 2: The overall framework of prompting protein structure predictions via MSA generation.Left: The challenge faced by conventional search algorithms on protein with scarce homologous sequences, resulting in suboptimal alignments. Middle-to-Right: MSAGPT generates informative and high-quality MSA for such challenging queries, presenting a promising approach to overcoming these limitations. $\texttt{[M]}$ denotes the sequence separator. $\texttt{[S]}, \texttt{[E]}$ are the special tokens to represent the start or end of MSA generation.
  • Figure 3: Comparisons among the axial attention (exemplified by rao2021msa) and the one in MSAGPT in a single layer. Here we focus on the information aggregated to the AA "G". The 2D evolutionary position enhanced attention shows higher efficiency than the decoupled axial attentions with one-step aggregation to attain sufficient information.
  • Figure 4: The effect of different MSA depths and selection methods. The X-axis indicates the different MSA depths. The Y-axis represents the TM-Score. The dashed line denotes the non-selection baseline.
  • Figure 5: Ablation study with positional embedding variants.
  • ...and 10 more figures

Theorems & Definitions (1)

  • Definition 1