Table of Contents
Fetching ...

Protein Design with Dynamic Protein Vocabulary

Nuowei Liu, Jiahao Kuang, Yanting Liu, Tao Ji, Changzhi Sun, Man Lan, Yuanbin Wu

TL;DR

This work tackles the challenge of designing proteins that are both functionally aligned and structurally plausible. It introduces ProDVa, a dynamic vocabulary framework that combines a Text Language Model, a Protein Language Model, and a Fragment Encoder to condition generation on textual descriptions via retrieval of natural protein fragments. The approach achieves competitive function alignment with far less training data while significantly improving foldability, as evidenced by higher pLDDT and lower PAE scores compared to strong baselines. By leveraging InterPro-identified fragments and a retrieval-based description mechanism, ProDVa demonstrates robust design performance across keyword- and text-driven tasks and suggests a promising direction for data-efficient, structure-aware protein design with potential for wet-lab validation. Overall, the method highlights the value of incorporating biologically meaningful fragments into a programmable protein vocabulary to steer de novo design toward realistic folds and intended functions.

Abstract

Protein design is a fundamental challenge in biotechnology, aiming to design novel sequences with specific functions within the vast space of possible proteins. Recent advances in deep generative models have enabled function-based protein design from textual descriptions, yet struggle with structural plausibility. Inspired by classical protein design methods that leverage natural protein structures, we explore whether incorporating fragments from natural proteins can enhance foldability in generative models. Our empirical results show that even random incorporation of fragments improves foldability. Building on this insight, we introduce ProDVa, a novel protein design approach that integrates a text encoder for functional descriptions, a protein language model for designing proteins, and a fragment encoder to dynamically retrieve protein fragments based on textual functional descriptions. Experimental results demonstrate that our approach effectively designs protein sequences that are both functionally aligned and structurally plausible. Compared to state-of-the-art models, ProDVa achieves comparable function alignment using less than 0.04% of the training data, while designing significantly more well-folded proteins, with the proportion of proteins having pLDDT above 70 increasing by 7.38% and those with PAE below 10 increasing by 9.6%.

Protein Design with Dynamic Protein Vocabulary

TL;DR

This work tackles the challenge of designing proteins that are both functionally aligned and structurally plausible. It introduces ProDVa, a dynamic vocabulary framework that combines a Text Language Model, a Protein Language Model, and a Fragment Encoder to condition generation on textual descriptions via retrieval of natural protein fragments. The approach achieves competitive function alignment with far less training data while significantly improving foldability, as evidenced by higher pLDDT and lower PAE scores compared to strong baselines. By leveraging InterPro-identified fragments and a retrieval-based description mechanism, ProDVa demonstrates robust design performance across keyword- and text-driven tasks and suggests a promising direction for data-efficient, structure-aware protein design with potential for wet-lab validation. Overall, the method highlights the value of incorporating biologically meaningful fragments into a programmable protein vocabulary to steer de novo design toward realistic folds and intended functions.

Abstract

Protein design is a fundamental challenge in biotechnology, aiming to design novel sequences with specific functions within the vast space of possible proteins. Recent advances in deep generative models have enabled function-based protein design from textual descriptions, yet struggle with structural plausibility. Inspired by classical protein design methods that leverage natural protein structures, we explore whether incorporating fragments from natural proteins can enhance foldability in generative models. Our empirical results show that even random incorporation of fragments improves foldability. Building on this insight, we introduce ProDVa, a novel protein design approach that integrates a text encoder for functional descriptions, a protein language model for designing proteins, and a fragment encoder to dynamically retrieve protein fragments based on textual functional descriptions. Experimental results demonstrate that our approach effectively designs protein sequences that are both functionally aligned and structurally plausible. Compared to state-of-the-art models, ProDVa achieves comparable function alignment using less than 0.04% of the training data, while designing significantly more well-folded proteins, with the proportion of proteins having pLDDT above 70 increasing by 7.38% and those with PAE below 10 increasing by 9.6%.

Paper Structure

This paper contains 58 sections, 14 equations, 16 figures, 7 tables.

Figures (16)

  • Figure 1: (a) Visualization of proteins designed by our method, embedded using ESM C hayes2024simulating and projected with UMAP mcinnes2018umap. Random refers to proteins generated by randomly selecting amino acids according to their empirical distribution in SwissProt bairoch2000swiss. Random+ refers to proteins generated by selecting amino acids and incorporating fragments. Natural proteins randomly sampled from SwissProt (in gray) form a broad distribution, representing the diverse landscape of natural proteins. Proteins generated by Random sampling from all possible sequences (in red scatter points) cluster tightly at the periphery of the natural protein distribution, suggesting that random proteins are less diverse. Proteins generated by Random+ (in yellow scatter points) exhibit a much more diverse distribution than those generated by Random. (b) Performance on pLDDT ($\uparrow$). Our method improves pLDDT by 12% over Random+, exceeding the well-folded threshold. (c) Performance on PAE ($\downarrow$). Our method reduces PAE by 9% compared to Random+ and is the only model to surpass the well-folded threshold. Notably, it outperforms the state-of-the-art baseline model Pinal in both metrics.
  • Figure 2: An example of Q96TW8, illustrating how an amino acid sequence is divided into sets of tokens and fragments. Note that when the BPE tokenizer sennrich2015neural is used, a single token may represent multiple amino acids.
  • Figure 3: Overview of our model architecture.
  • Figure 4: Experimental results comparing ProDVa with a vanilla multimodal baseline on Mol-Instructions. (a) illustrates performance on sequence plausibility metrics. (b) and (c) show performance on foldability metrics. (d) presents performance on language alignment metrics, where deeper colors in the upper right corner indicate better performance.
  • Figure 5: Analysis regarding the selection of Top $K$ most relevant descriptions during inference on Mol-Instructions. Results on the CAMEO subset are provided in Appendix \ref{['sec:exp-more-topk']}.
  • ...and 11 more figures