Table of Contents
Fetching ...

PatchProt: Hydrophobic patch prediction using protein foundation models

Dea Gogishvili, Emmanuel Minois-Genin, Jan van Eck, Sanne Abeln

TL;DR

This research sets a new standard for sequence-based protein property prediction and highlights the remarkable potential of fine-tuning foundation models enriching the model representation by training over related tasks.

Abstract

Hydrophobic patches on protein surfaces play important functional roles in protein-protein and protein-ligand interactions. Large hydrophobic surfaces are also involved in the progression of aggregation diseases. Predicting exposed hydrophobic patches from a protein sequence has been shown to be a difficult task. Fine-tuning foundation models allows for adapting a model to the specific nuances of a new task using a much smaller dataset. Additionally, multi-task deep learning offers a promising solution for addressing data gaps, simultaneously outperforming single-task methods. In this study, we harnessed a recently released leading large language model ESM-2. Efficient fine-tuning of ESM-2 was achieved by leveraging a recently developed parameter-efficient fine-tuning method. This approach enabled comprehensive training of model layers without excessive parameters and without the need to include a computationally expensive multiple sequence analysis. We explored several related tasks, at local (residue) and global (protein) levels, to improve the representation of the model. As a result, our fine-tuned ESM-2 model, PatchProt, cannot only predict hydrophobic patch areas but also outperforms existing methods at predicting primary tasks, including secondary structure and surface accessibility predictions. Importantly, our analysis shows that including related local tasks can improve predictions on more difficult global tasks. This research sets a new standard for sequence-based protein property prediction and highlights the remarkable potential of fine-tuning foundation models enriching the model representation by training over related tasks.

PatchProt: Hydrophobic patch prediction using protein foundation models

TL;DR

This research sets a new standard for sequence-based protein property prediction and highlights the remarkable potential of fine-tuning foundation models enriching the model representation by training over related tasks.

Abstract

Hydrophobic patches on protein surfaces play important functional roles in protein-protein and protein-ligand interactions. Large hydrophobic surfaces are also involved in the progression of aggregation diseases. Predicting exposed hydrophobic patches from a protein sequence has been shown to be a difficult task. Fine-tuning foundation models allows for adapting a model to the specific nuances of a new task using a much smaller dataset. Additionally, multi-task deep learning offers a promising solution for addressing data gaps, simultaneously outperforming single-task methods. In this study, we harnessed a recently released leading large language model ESM-2. Efficient fine-tuning of ESM-2 was achieved by leveraging a recently developed parameter-efficient fine-tuning method. This approach enabled comprehensive training of model layers without excessive parameters and without the need to include a computationally expensive multiple sequence analysis. We explored several related tasks, at local (residue) and global (protein) levels, to improve the representation of the model. As a result, our fine-tuned ESM-2 model, PatchProt, cannot only predict hydrophobic patch areas but also outperforms existing methods at predicting primary tasks, including secondary structure and surface accessibility predictions. Importantly, our analysis shows that including related local tasks can improve predictions on more difficult global tasks. This research sets a new standard for sequence-based protein property prediction and highlights the remarkable potential of fine-tuning foundation models enriching the model representation by training over related tasks.
Paper Structure (25 sections, 11 equations, 4 figures, 4 tables)

This paper contains 25 sections, 11 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Model architecture. The model takes protein sequence as input and predicts both global and local protein properties. The model consists of an embedding output from ESM-2 protein language model lin2023evolutionary and the downstream architecture similar to NetSurfP-3.0 hoie2022netsurfp. Additionally, a parameter-efficient fine-tuning strategy was implemented (Figure \ref{['figsupp:LoRA']}) hu2021LoRApfeiffer2021adapterfusion. The decoding head consists of a residual block with two convolutional neural network (CNN) layers and a two-layer bidirectional long short-term memory (BiLSTM) network. The output is fed into a fully connected layer to provide predictions for all residues- and protein-level tasks.
  • Figure 2: Assessment of hydrophobic patch (HP) predictions. (A) 154L chain A - Case example from the test set of CB513. A visualisation for PatchProt predictions in a manner of NetSurfP-3.0. (B) Ground truth labels for the same protein structure were calculated from DSSP (for total hydrophobic surface area (THSA) and relative hydrophobic surface area (RHSA) and MolPatch (for the largest HP).
  • Figure S1: Fine-tuning strategy with Low-Rank Adaptation (LoRA). To efficiently fine-tune the foundation model, we adopted recent advancements in parameter-efficient fine-tuning LoRa. In our approach, we applied LoRA to every linear layer within the original transformer architecture vaswani2017attention, significantly reducing the number of updated parameters (to $2rd$ from the layer's original $d^2$). $W_0$ denotes the original weight matrix, decomposed into two lower-rank matrices, $A$ and $B$, with dimensions $r \times d$ and $d \times r$ respectively (see fine-tuning strategy section in Supplementary Information).
  • Figure S2: Benchmarking global largest hydrophobic predictions (LHP). The Accuracy of the predictions of the largest patch hydrophobic surface area was compared using threshold curves. Global predictions by PatchProt are benchmarked against other methods, including the three-feature model (TFM), which uses the sequence length, number of hydrophobic amino acids and number of hydrophilic amino acids as input features kuhn2008building. The global feature model (GFM) trained on 31 global features using an XGBoost regressor chen2016xgboost. NetSurfP-2.0-based model (NBM), which is a random forest model trained using the relative and total hydrophobic surface area values (THSA, RHSA) predicted by NetSurfP-2.0, since the LHP cannot be calculated from NetSurfP-2.0 predictions directly van2022sticky. The fraction of correctly predicted proteins within a certain error margin for each of the methods is shown as calculated over the test set. The test set and the threshold curve calculations were replicated from the previous study van2022sticky. Importantly, the large fraction of proteins in the test set were used to train PatchProt. For a fair comparison, the overlapping proteins were removed and the curves were calculated for the rest of the proteins in the test set (n=346).