Table of Contents
Fetching ...

Peptide-GPT: Generative Design of Peptides using Generative Pre-trained Transformers and Bio-informatic Supervision

Aayush Shah, Chakradhar Guntuboina, Amir Barati Farimani

TL;DR

The experimental results demonstrate the effectiveness of PeptideGPT in de novo protein design and underscore the potential of leveraging NLP-based approaches for paving the way for future innovations and breakthroughs in synthetic biology and bioinformatics.

Abstract

In recent years, natural language processing (NLP) models have demonstrated remarkable capabilities in various domains beyond traditional text generation. In this work, we introduce PeptideGPT, a protein language model tailored to generate protein sequences with distinct properties: hemolytic activity, solubility, and non-fouling characteristics. To facilitate a rigorous evaluation of these generated sequences, we established a comprehensive evaluation pipeline consisting of ideas from bioinformatics to retain valid proteins with ordered structures. First, we rank the generated sequences based on their perplexity scores, then we filter out those lying outside the permissible convex hull of proteins. Finally, we predict the structure using ESMFold and select the proteins with pLDDT values greater than 70 to ensure ordered structure. The properties of generated sequences are evaluated using task-specific classifiers - PeptideBERT and HAPPENN. We achieved an accuracy of 76.26% in hemolytic, 72.46% in non-hemolytic, 78.84% in non-fouling, and 68.06% in solubility protein generation. Our experimental results demonstrate the effectiveness of PeptideGPT in de novo protein design and underscore the potential of leveraging NLP-based approaches for paving the way for future innovations and breakthroughs in synthetic biology and bioinformatics. Codes, models, and data used in this study are freely available at: https://github.com/aayush-shah14/PeptideGPT.

Peptide-GPT: Generative Design of Peptides using Generative Pre-trained Transformers and Bio-informatic Supervision

TL;DR

The experimental results demonstrate the effectiveness of PeptideGPT in de novo protein design and underscore the potential of leveraging NLP-based approaches for paving the way for future innovations and breakthroughs in synthetic biology and bioinformatics.

Abstract

In recent years, natural language processing (NLP) models have demonstrated remarkable capabilities in various domains beyond traditional text generation. In this work, we introduce PeptideGPT, a protein language model tailored to generate protein sequences with distinct properties: hemolytic activity, solubility, and non-fouling characteristics. To facilitate a rigorous evaluation of these generated sequences, we established a comprehensive evaluation pipeline consisting of ideas from bioinformatics to retain valid proteins with ordered structures. First, we rank the generated sequences based on their perplexity scores, then we filter out those lying outside the permissible convex hull of proteins. Finally, we predict the structure using ESMFold and select the proteins with pLDDT values greater than 70 to ensure ordered structure. The properties of generated sequences are evaluated using task-specific classifiers - PeptideBERT and HAPPENN. We achieved an accuracy of 76.26% in hemolytic, 72.46% in non-hemolytic, 78.84% in non-fouling, and 68.06% in solubility protein generation. Our experimental results demonstrate the effectiveness of PeptideGPT in de novo protein design and underscore the potential of leveraging NLP-based approaches for paving the way for future innovations and breakthroughs in synthetic biology and bioinformatics. Codes, models, and data used in this study are freely available at: https://github.com/aayush-shah14/PeptideGPT.

Paper Structure

This paper contains 15 sections, 1 equation, 6 figures, 4 tables.

Figures (6)

  • Figure 1: The pipeline for PeptideGPT. The peptide sequences are tokenized and passed through ProtGPT2 for fine-tuning on the four tasks. The generated peptides undergo bio-informatic supervision to ensure that only valid proteins with an ordered structure are chosen as the final output. The property is then evaluated through a suitable classifier.
  • Figure 2: Sequence length distribution of each Peptide property dataset (a) Hemolytic, (b) Non-Hemolytic, (c) Non-Fouling, and (d) Solubility.
  • Figure 3: Convex hulls for Alanine. Points inside these hulls are considered to be valid protein sequences.
  • Figure 4: Accuracy of PeptideGPT vs ProtGPT2 generated sequences on each task. We can see an increase in the accuracy due to fine-tuning.
  • Figure 5: Example of generated sequences each predicted to have the desired property: a) Hemolytic b) Soluble c) Non-Fouling d) Non-Hemolytic
  • ...and 1 more figures