Multi-Peptide: Multimodality Leveraged Language-Graph Learning of Peptide Properties

Srivathsan Badrinarayanan; Chakradhar Guntuboina; Parisa Mollaei; Amir Barati Farimani

Multi-Peptide: Multimodality Leveraged Language-Graph Learning of Peptide Properties

Srivathsan Badrinarayanan, Chakradhar Guntuboina, Parisa Mollaei, Amir Barati Farimani

TL;DR

Multi-Peptide addresses the challenge of predicting peptide properties by fusing sequence-based representations from PeptideBERT with structure-aware embeddings from a Graph Neural Network trained on AlphaFold-derived PDB graphs. A CLIP-style loss aligns the two modalities into a shared latent space, enabling joint learning that leverages both amino-acid sequence context and three-dimensional structural information. The approach achieves state-of-the-art performance on hemolysis prediction ($86.185\%$) and demonstrates robust multimodal behavior, though it shows task-dependent gains with nonfouling data where a fine-tuned text model still outperforms the ensemble. Overall, the work highlights the promise of multimodal learning in bioinformatics for more accurate and holistic peptide property predictions, with open resources for reproducibility and future method refinements.

Abstract

Peptides are essential in biological processes and therapeutics. In this study, we introduce Multi-Peptide, an innovative approach that combines transformer-based language models with Graph Neural Networks (GNNs) to predict peptide properties. We combine PeptideBERT, a transformer model tailored for peptide property prediction, with a GNN encoder to capture both sequence-based and structural features. By employing Contrastive Language-Image Pre-training (CLIP), Multi-Peptide aligns embeddings from both modalities into a shared latent space, thereby enhancing the model's predictive accuracy. Evaluations on hemolysis and nonfouling datasets demonstrate Multi-Peptide's robustness, achieving state-of-the-art 86.185% accuracy in hemolysis prediction. This study highlights the potential of multimodal learning in bioinformatics, paving the way for accurate and reliable predictions in peptide-based research and applications.

Multi-Peptide: Multimodality Leveraged Language-Graph Learning of Peptide Properties

TL;DR

) and demonstrates robust multimodal behavior, though it shows task-dependent gains with nonfouling data where a fine-tuned text model still outperforms the ensemble. Overall, the work highlights the promise of multimodal learning in bioinformatics for more accurate and holistic peptide property predictions, with open resources for reproducibility and future method refinements.

Abstract

Paper Structure (11 sections, 3 equations, 3 figures, 4 tables)

This paper contains 11 sections, 3 equations, 3 figures, 4 tables.

Introduction
Methods
Datasets
Model Architecture
Results and Discussion
Visualization of Representations
Conclusion
Data and Software Availability
Supporting Information
Model training
Reproducibility and other comments

Figures (3)

Figure 1: Representation of the Multi-Peptide framework. This figure shows the pre-training of the PeptideBERT and GNN encoders, showcasing the CLIP's ability to further train the BERT model by freezing the GNN weights. Inference is done at the end using the updated PeptideBERT weights on each of the test datasets.
Figure 2: Distribution of peptide sequence lengths across the hemolysis and non-fouling datasets. (a) Sequence lengths vary from 1 to 190 amino acids in the hemolysis dataset. The distribution is not uniform, showing a prominent peak and a spread. (b) Number of atoms vary from 7 to 1456 in the hemolysis dataset. The distribution mirrors the corresponding sequence length distribution. (c) Sequence lengths range from 5 to 198 amino acids in the nonfouling dataset. The distribution is relatively uniform at higher lengths, with a peak at lower values and sparse occurrences over a wide range. (d) Number of atoms vary from 25 to 1688 in the nonfouling dataset. The distribution mirrors the corresponding sequence length distribution.
Figure 3: tSNE plots for the embeddings corresponding to the nonfouling test dataset. The blue points represent the negatively marked sequences, while the red points denote the positives. (a) 2D tSNE plot of the embedding space generated by the PeptideBERT encoder. (b) 2D tSNE plot of the embedding space generated by the GNN encoder. (c) 2D tSNE plot for the shared latent space after CLIP. (d) 3D tSNE plot for the shared latent space after CLIP.

Multi-Peptide: Multimodality Leveraged Language-Graph Learning of Peptide Properties

TL;DR

Abstract

Multi-Peptide: Multimodality Leveraged Language-Graph Learning of Peptide Properties

Authors

TL;DR

Abstract

Table of Contents

Figures (3)