Table of Contents
Fetching ...

PepGB: Facilitating peptide drug discovery via graph neural networks

Yipin Lei, Xu Wang, Meng Fang, Han Li, Xiang Li, Jianyang Zeng

TL;DR

PepGB addresses the bottlenecks of peptide drug discovery by predicting peptide-protein interactions on a heterogeneous graph, leveraging graph attention networks with a DropMessage perturbation and a dual-view loss to mitigate overfitting and data imbalance. A contrastive pre-training strategy enables robust peptide representations from a large unlabeled sequence corpus, improving generalization to novel targets and hits. To tackle imbalanced lead-generation data, diPepGB introduces directed edges encoding relative binding strength, enabling effective modeling in realistic assay conditions and supporting virtual alanine scanning. Across rigorous, cluster-based validations, PepGB shows superior performance over baselines in novel settings, while diPepGB demonstrates strong performance on imbalanced data and real-world lead optimization tasks, underscoring its potential to accelerate peptide early drug discovery.

Abstract

Peptides offer great biomedical potential and serve as promising drug candidates. Currently, the majority of approved peptide drugs are directly derived from well-explored natural human peptides. It is quite necessary to utilize advanced deep learning techniques to identify novel peptide drugs in the vast, unexplored biochemical space. Despite various in silico methods having been developed to accelerate peptide early drug discovery, existing models face challenges of overfitting and lacking generalizability due to the limited size, imbalanced distribution and inconsistent quality of experimental data. In this study, we propose PepGB, a deep learning framework to facilitate peptide early drug discovery by predicting peptide-protein interactions (PepPIs). Employing graph neural networks, PepGB incorporates a fine-grained perturbation module and a dual-view objective with contrastive learning-based peptide pre-trained representation to predict PepPIs. Through rigorous evaluations, we demonstrated that PepGB greatly outperforms baselines and can accurately identify PepPIs for novel targets and peptide hits, thereby contributing to the target identification and hit discovery processes. Next, we derive an extended version, diPepGB, to tackle the bottleneck of modeling highly imbalanced data prevalent in lead generation and optimization processes. Utilizing directed edges to represent relative binding strength between two peptide nodes, diPepGB achieves superior performance in real-world assays. In summary, our proposed frameworks can serve as potent tools to facilitate peptide early drug discovery.

PepGB: Facilitating peptide drug discovery via graph neural networks

TL;DR

PepGB addresses the bottlenecks of peptide drug discovery by predicting peptide-protein interactions on a heterogeneous graph, leveraging graph attention networks with a DropMessage perturbation and a dual-view loss to mitigate overfitting and data imbalance. A contrastive pre-training strategy enables robust peptide representations from a large unlabeled sequence corpus, improving generalization to novel targets and hits. To tackle imbalanced lead-generation data, diPepGB introduces directed edges encoding relative binding strength, enabling effective modeling in realistic assay conditions and supporting virtual alanine scanning. Across rigorous, cluster-based validations, PepGB shows superior performance over baselines in novel settings, while diPepGB demonstrates strong performance on imbalanced data and real-world lead optimization tasks, underscoring its potential to accelerate peptide early drug discovery.

Abstract

Peptides offer great biomedical potential and serve as promising drug candidates. Currently, the majority of approved peptide drugs are directly derived from well-explored natural human peptides. It is quite necessary to utilize advanced deep learning techniques to identify novel peptide drugs in the vast, unexplored biochemical space. Despite various in silico methods having been developed to accelerate peptide early drug discovery, existing models face challenges of overfitting and lacking generalizability due to the limited size, imbalanced distribution and inconsistent quality of experimental data. In this study, we propose PepGB, a deep learning framework to facilitate peptide early drug discovery by predicting peptide-protein interactions (PepPIs). Employing graph neural networks, PepGB incorporates a fine-grained perturbation module and a dual-view objective with contrastive learning-based peptide pre-trained representation to predict PepPIs. Through rigorous evaluations, we demonstrated that PepGB greatly outperforms baselines and can accurately identify PepPIs for novel targets and peptide hits, thereby contributing to the target identification and hit discovery processes. Next, we derive an extended version, diPepGB, to tackle the bottleneck of modeling highly imbalanced data prevalent in lead generation and optimization processes. Utilizing directed edges to represent relative binding strength between two peptide nodes, diPepGB achieves superior performance in real-world assays. In summary, our proposed frameworks can serve as potent tools to facilitate peptide early drug discovery.
Paper Structure (29 sections, 7 equations, 6 figures, 3 tables)

This paper contains 29 sections, 7 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Overview of PepGB. A The motivation of our proposed framework is to address empirical challenges in drug discovery, i.e., we only have limited experimental interaction data; popular targets or peptides are more frequently measured and the remaining unknown interactions do not always represent negatives; binding labels are inconsistent due to batch effects and systematic errors. B PepGB is a heterogeneous graph-based framework to predict PepPIs. Pre-trained sequence embeddings are served as node features. Protein-protein interactions of existing protein nodes are supplemented for additional message passing. C PepGB exploits graph attention neural network (GAT) to update node features via aggregation from neighboring nodes. To avoid overfitting and improve generalizability, the DropMessage module randomly applies dropout on each element of the message passing matrix. D The tailor-made contrastive learning-based pre-training strategy aims to learn peptide representation from a large-scale peptide sequence database.
  • Figure 2: An illustration of the framework of diPepGB. A diPepGB aims to address the imbalance nature of experimental data, which would lead to the uneven local topology on PepPI graph. The "firework" style sub-graph can be formulated into a directed graph by constructing pairwise comparison of the peptide binding strength. To make diPepGB maintain robust against systematic errors, we define error-tolerated directed edges sourcing from peptide nodes with significantly stronger affinities and pointing to the peptide nodes with weaker affinities. B We illustrated the application of diPepGB in lead optimization using binding assays of peptide analogs binding to the oncoprotein MDM2.
  • Figure 3: The validation settings to evaluate PepGB. A Computational models commonly adopt the random-split setting for performance evaluation, leading to over-optimistic results. To mimic more realistic scenarios, nodes are first clustered into groups and then different groups are assigned to the training and test sets, respectively. B The "novel protein setting" guarantees the training edges and test edges do not share similar protein nodes. C The "novel peptide setting" guarantees the training edges and test edges do not share similar peptide nodes. D The "novel pair setting" guarantees the training edges and test edges neither share similar peptide nodes nor share similar protein nodes.The validation settings to evaluate PepGB. A Computational models commonly adopt the random-split setting for performance evaluation, leading to over-optimistic results. To mimic more realistic scenarios, nodes are first clustered into groups and then different groups are assigned to the training and test sets, respectively. B The "novel protein setting" guarantees the training edges and test edges do not share similar protein nodes. C The "novel peptide setting" guarantees the training edges and test edges do not share similar peptide nodes. D The "novel pair setting" guarantees the training edges and test edges neither share similar peptide nodes nor share similar protein nodes.
  • Figure 4: Performance of PepGB and other baselines on binary PepPI prediction. A and B show the AUC and AUPR scores of PepGB and six baseline methods under three cross-validation settings, respectively.
  • Figure 5: Performance of diPepGB on PMI peptide analogs that bind to the anti-tumor target MDM2. A The data splitting strategy and validation setting of diPepGB. B The AUC and AUPR scores of diPepGB and two baselines. More specifically, diPepGB outperformed two baselines both in terms of AUC and AUPR scores.
  • ...and 1 more figures