Table of Contents
Fetching ...

Computational design of target-specific linear peptide binders with TransformerBeta

Haowen Zhao, Francesco A. Aprile, Barbara Bravi

TL;DR

An unprecedentedly large-scale library of peptide pairs within stable secondary structures (beta sheets) is built and a machine learning method based on the Transformer architecture is developed for the design of specific linear binders, in analogy to a language translation task.

Abstract

The computational prediction and design of peptide binders targeting specific linear epitopes is crucial in biological and biomedical research, yet it remains challenging due to their highly dynamic nature and the scarcity of experimentally solved binding data. To address this problem, we built an unprecedentedly large-scale library of peptide pairs within stable secondary structures (beta sheets), leveraging newly available AlphaFold predicted structures. We then developed a machine learning method based on the Transformer architecture for the design of specific linear binders, in analogy to a language translation task. Our method, TransformerBeta, accurately predicts specific beta strand interactions and samples sequences with beta sheet-like molecular properties, while capturing interpretable physico-chemical interaction patterns. As such, it can propose specific candidate binders targeting linear epitope for experimental validation to inform protein design.

Computational design of target-specific linear peptide binders with TransformerBeta

TL;DR

An unprecedentedly large-scale library of peptide pairs within stable secondary structures (beta sheets) is built and a machine learning method based on the Transformer architecture is developed for the design of specific linear binders, in analogy to a language translation task.

Abstract

The computational prediction and design of peptide binders targeting specific linear epitopes is crucial in biological and biomedical research, yet it remains challenging due to their highly dynamic nature and the scarcity of experimentally solved binding data. To address this problem, we built an unprecedentedly large-scale library of peptide pairs within stable secondary structures (beta sheets), leveraging newly available AlphaFold predicted structures. We then developed a machine learning method based on the Transformer architecture for the design of specific linear binders, in analogy to a language translation task. Our method, TransformerBeta, accurately predicts specific beta strand interactions and samples sequences with beta sheet-like molecular properties, while capturing interpretable physico-chemical interaction patterns. As such, it can propose specific candidate binders targeting linear epitope for experimental validation to inform protein design.

Paper Structure

This paper contains 22 sections, 12 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: AlphaFold 2 Beta Strand Database. (A) Illustration of the collection of high-confidence beta strand pairs from Alphafold Protein Structure Database. Example structures (viewed with Mol* viewer sehnal2021mol) show two high-confidence pairs (anti-parallel and parallel) and one pair that did not meet the high-confidence criteria. (B) Peptide length distribution of beta strand pairs with overall proportion of anti-parallel (74.9%) to parallel (25.1%) pairs. (C) Average number of potential binders available for each distinct target sequence. For clarity of visualization, lengths < 6 are plotted as an inset. (D) Pairwise dissimilarity distribution for anti-parallel length 8 data (a subset of 1,933,932 pairs) using the normalized Hamming distance. In B-C, pairs with lengths longer than 20 are grouped together for clarity.
  • Figure 2: Strategy for designing binders to target a linear epitope using TransformerBeta. The target protein is shown in silver, with the epitope of interest highlighted in red. TransformerBeta takes as input the target sequence (from N to C terminus) and generates diverse target-specific linear peptides that are putative binders. The putative bound structure is a simulated docking pose using HPEPDOCK zhou2018hpepdock and all protein structures are viewed using Mol* viewer sehnal2021mol.
  • Figure 3: Model's prediction accuracy. (A) Distribution of the TransformerBeta predicted probability assigned to binders in the test, shuffled, and random sets, each containing 107,441 sequence pairs. (B) Receiver Operating Characteristic (ROC) curve and corresponding Area Under the Curve (ROC-AUC). (C) Precision-Recall (PR) curve and corresponding Area Under the Curve (PR-AUC). The dashed line gives the performance of a random classifier (ROC-AUC$=$PR-AUC$=$0.50). ROC-AUC as a function of the closest Hamming distance between test and training targets (D) and between test and training binders (E). Hamming distances of 4, having fewer than 5 data points, are grouped with distance 3.
  • Figure 4: Properties of generated data. (A) t-SNE projected distributions of 5,000 randomly sampled binders from natural set, the model generated set and random set. (B) Cumulative Distribution Function (CDF) of various physicochemical properties (Net charge, Hydrophobicity, Molecular weight, Isoelectric point, Aromaticity) for the same natural, model generated, and random binders as in (A).
  • Figure 5: Interpretability of TransformerBeta. (A) Input embedding shared across encoder and decoder. (B) Scatter plot comparing embedding cosine similarity scores and BLOSUM62 substitution scores (Supp. Methods \ref{['appendix: embedding']}). (C) Average cross-attention map. $X_i$ and $Y_i$ represents the $i^{th}$ amino acid of target and binder respectively. <bos> and <eos> are two special tokens.
  • ...and 5 more figures