Table of Contents
Fetching ...

Open-Source Protein Language Models for Function Prediction and Protein Design

Shivasankaran Vanaja Pandi, Bharath Ramsundar

TL;DR

Open-Source Protein Language Models for Function Prediction and Protein Design addresses the barrier to entry for large PLMs by embedding ProtBERT within the Open-Source DeepChem framework. The authors demonstrate competitive performance on several protein-prediction tasks using embeddings from a pretrained model trained on ~1M sequences, and they explore latent-space manipulation to generate plastic-degrading enzyme candidates. They validate generated structures with AlphaFold, providing a proof-of-concept for environment-focused protein design under resource constraints. The work offers a reusable, accessible platform to accelerate protein-function prediction and design in synthetic biology and sustainability research.

Abstract

Protein language models (PLMs) have shown promise in improving the understanding of protein sequences, contributing to advances in areas such as function prediction and protein engineering. However, training these models from scratch requires significant computational resources, limiting their accessibility. To address this, we integrate a PLM into DeepChem, an open-source framework for computational biology and chemistry, to provide a more accessible platform for protein-related tasks. We evaluate the performance of the integrated model on various protein prediction tasks, showing that it achieves reasonable results across benchmarks. Additionally, we present an exploration of generating plastic-degrading enzyme candidates using the model's embeddings and latent space manipulation techniques. While the results suggest that further refinement is needed, this approach provides a foundation for future work in enzyme design. This study aims to facilitate the use of PLMs in research fields like synthetic biology and environmental sustainability, even for those with limited computational resources.

Open-Source Protein Language Models for Function Prediction and Protein Design

TL;DR

Open-Source Protein Language Models for Function Prediction and Protein Design addresses the barrier to entry for large PLMs by embedding ProtBERT within the Open-Source DeepChem framework. The authors demonstrate competitive performance on several protein-prediction tasks using embeddings from a pretrained model trained on ~1M sequences, and they explore latent-space manipulation to generate plastic-degrading enzyme candidates. They validate generated structures with AlphaFold, providing a proof-of-concept for environment-focused protein design under resource constraints. The work offers a reusable, accessible platform to accelerate protein-function prediction and design in synthetic biology and sustainability research.

Abstract

Protein language models (PLMs) have shown promise in improving the understanding of protein sequences, contributing to advances in areas such as function prediction and protein engineering. However, training these models from scratch requires significant computational resources, limiting their accessibility. To address this, we integrate a PLM into DeepChem, an open-source framework for computational biology and chemistry, to provide a more accessible platform for protein-related tasks. We evaluate the performance of the integrated model on various protein prediction tasks, showing that it achieves reasonable results across benchmarks. Additionally, we present an exploration of generating plastic-degrading enzyme candidates using the model's embeddings and latent space manipulation techniques. While the results suggest that further refinement is needed, this approach provides a foundation for future work in enzyme design. This study aims to facilitate the use of PLMs in research fields like synthetic biology and environmental sustainability, even for those with limited computational resources.

Paper Structure

This paper contains 16 sections, 3 figures, 1 table.

Figures (3)

  • Figure 1: DeepChem Pipeline Illustration
  • Figure 2: Protein generation pipeline: (a) Overview of the training pipeline for the protein generation model (b) Protein generation workflow demonstrating the use of a seed protein sequence to generate novel proteins with targeted properties.
  • Figure 3: Qualitative Results of Generated Proteins. Two proteins generated by our method. The protein structures are color-coded according to the plDDTjumper2021highly score: Blue: very high (plDDT $>$ 90), teal: confident (90 $>$ plDDT $>$ 70), yellow: low (70 $>$ plDDT $>$ 50), orange: very low (plDDT $<$ 50)