Table of Contents
Fetching ...

Progressive Multi-Modality Learning for Inverse Protein Folding

Jiangbin Zheng, Stan Z. Li

TL;DR

The paper tackles inverse protein folding under data scarcity by proposing MMDesign, a multi-modality transfer learning framework that combines a pretrained GVP-based structural module with a pretrained auto-encoder contextual module, linked by cross-modal alignment constraints. Through a two-stage training process, MMDesign leverages structural priors and language-like sequence knowledge to generate coherent protein sequences from backbone coordinates, achieving state-of-the-art perplexity and recovery on the CATH benchmark despite small training data. The work includes thorough ablation and distribution analyses, demonstrating the benefits of pretrained modules and offering interpretability about protein design patterns. Overall, MMDesign reduces data requirements while improving generalization across standard and out-of-domain tests, contributing a principled approach to cross-modal protein design with tangible practical impact in protein engineering.

Abstract

While deep generative models show promise for learning inverse protein folding directly from data, the lack of publicly available structure-sequence pairings limits their generalization. Previous improvements and data augmentation efforts to overcome this bottleneck have been insufficient. To further address this challenge, we propose a novel protein design paradigm called MMDesign, which leverages multi-modality transfer learning. To our knowledge, MMDesign is the first framework that combines a pretrained structural module with a pretrained contextual module, using an auto-encoder (AE) based language model to incorporate prior protein semantic knowledge. Experimental results, only training with the small dataset, demonstrate that MMDesign consistently outperforms baselines on various public benchmarks. To further assess the biological plausibility, we present systematic quantitative analysis techniques that provide interpretability and reveal more about the laws of protein design.

Progressive Multi-Modality Learning for Inverse Protein Folding

TL;DR

The paper tackles inverse protein folding under data scarcity by proposing MMDesign, a multi-modality transfer learning framework that combines a pretrained GVP-based structural module with a pretrained auto-encoder contextual module, linked by cross-modal alignment constraints. Through a two-stage training process, MMDesign leverages structural priors and language-like sequence knowledge to generate coherent protein sequences from backbone coordinates, achieving state-of-the-art perplexity and recovery on the CATH benchmark despite small training data. The work includes thorough ablation and distribution analyses, demonstrating the benefits of pretrained modules and offering interpretability about protein design patterns. Overall, MMDesign reduces data requirements while improving generalization across standard and out-of-domain tests, contributing a principled approach to cross-modal protein design with tangible practical impact in protein engineering.

Abstract

While deep generative models show promise for learning inverse protein folding directly from data, the lack of publicly available structure-sequence pairings limits their generalization. Previous improvements and data augmentation efforts to overcome this bottleneck have been insufficient. To further address this challenge, we propose a novel protein design paradigm called MMDesign, which leverages multi-modality transfer learning. To our knowledge, MMDesign is the first framework that combines a pretrained structural module with a pretrained contextual module, using an auto-encoder (AE) based language model to incorporate prior protein semantic knowledge. Experimental results, only training with the small dataset, demonstrate that MMDesign consistently outperforms baselines on various public benchmarks. To further assess the biological plausibility, we present systematic quantitative analysis techniques that provide interpretability and reveal more about the laws of protein design.
Paper Structure (10 sections, 6 equations, 4 figures, 2 tables)

This paper contains 10 sections, 6 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: (a) The mainstream advanced deep protein design framework, where the GNN-based structure module represents structural features, followed by an optional contextual module, and the decoder responsible for decoding sequences. (b) The proposed novel paradigm of protein design framework consists of two pretrained modules, i.e., the pretrained structure module and the auto-encoder (AE) contextual module. Additional cross-layer cross-modal alignment aims to enhance constraints.
  • Figure 2: Diagram of the MMDesign framework. The proposed training pipeline is divided into two steps: Step 1 involves pretraining the structural module and the contextual module separately, while Step 2 entails the pretrained modules transferred from Step 1 to optimize the overall MMDesign framework.
  • Figure 3: Normalized statistics of the number of amino acid residue types corresponding to the ground-truth sequences (base) and generated sequences of protein design models on CATH test set. The KL value indicates the KL divergence of the generated sequence distribution with respect to the ground-truth sequence distribution.
  • Figure 4: Recovery score comparison of different models on short, medium and long sequences derived from CATH test set (Length division: Short $\in (0,125]$, Medium $\in (125, 188]$, Long $\in (188, 500)$. The three divisions are equal in number).