Table of Contents
Fetching ...

Topology-enhanced machine learning model (Top-ML) for anticancer peptide prediction

Joshua Zhi En Tan, JunJie Wee, Xue Gong, Kelin Xia

TL;DR

Top-ML addresses the featurization bottleneck in anticancer peptide prediction by integrating topology-driven representations with sequence-based encodings. It combines natural vector, Magnus vector (420-dimensional), terminus composition features, and spectral representations from sequence-based Laplacians into an Extra Trees classifier, achieving state-of-the-art or competitive results on AntiCP 2.0 and mACPpred 2.0 while offering improved interpretability. Key findings highlight the predictive value of mean index-position spectral features and the Magnus representation with a 5-length window, with feature-importance analysis revealing biologically plausible signals such as amino acid clustering and glutamic acid distribution. The framework is scalable to other sequential data and provides a computationally efficient alternative to deep learning for ACP screening, with future work suggested to incorporate 3D structural information and extend to modified peptides.

Abstract

Recently, therapeutic peptides have demonstrated great promise for cancer treatment. To explore powerful anticancer peptides, artificial intelligence (AI)-based approaches have been developed to systematically screen potential candidates. However, the lack of efficient featurization of peptides has become a bottleneck for these machine-learning models. In this paper, we propose a topology-enhanced machine learning model (Top-ML) for anticancer peptides prediction. Our Top-ML employs peptide topological features derived from its sequence "connection" information characterized by vector and spectral descriptors. Our Top-ML model, employing an Extra-Trees classifier, has been validated on the AntiCP 2.0 and mACPpred 2.0 benchmark datasets, achieving state-of-the-art performance or results comparable to existing deep learning models, while providing greater interpretability. Our results highlight the potential of leveraging novel topology-based featurization to accelerate the identification of anticancer peptides.

Topology-enhanced machine learning model (Top-ML) for anticancer peptide prediction

TL;DR

Top-ML addresses the featurization bottleneck in anticancer peptide prediction by integrating topology-driven representations with sequence-based encodings. It combines natural vector, Magnus vector (420-dimensional), terminus composition features, and spectral representations from sequence-based Laplacians into an Extra Trees classifier, achieving state-of-the-art or competitive results on AntiCP 2.0 and mACPpred 2.0 while offering improved interpretability. Key findings highlight the predictive value of mean index-position spectral features and the Magnus representation with a 5-length window, with feature-importance analysis revealing biologically plausible signals such as amino acid clustering and glutamic acid distribution. The framework is scalable to other sequential data and provides a computationally efficient alternative to deep learning for ACP screening, with future work suggested to incorporate 3D structural information and extend to modified peptides.

Abstract

Recently, therapeutic peptides have demonstrated great promise for cancer treatment. To explore powerful anticancer peptides, artificial intelligence (AI)-based approaches have been developed to systematically screen potential candidates. However, the lack of efficient featurization of peptides has become a bottleneck for these machine-learning models. In this paper, we propose a topology-enhanced machine learning model (Top-ML) for anticancer peptides prediction. Our Top-ML employs peptide topological features derived from its sequence "connection" information characterized by vector and spectral descriptors. Our Top-ML model, employing an Extra-Trees classifier, has been validated on the AntiCP 2.0 and mACPpred 2.0 benchmark datasets, achieving state-of-the-art performance or results comparable to existing deep learning models, while providing greater interpretability. Our results highlight the potential of leveraging novel topology-based featurization to accelerate the identification of anticancer peptides.
Paper Structure (6 sections, 6 equations, 5 figures, 6 tables)

This paper contains 6 sections, 6 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Illustration of the transformation of (a) the peptide sequence "ACDE" into (b) a sequence-based Laplacian $\hat{\mathbf{L}}_0$ and (c) its corresponding symmetric normalized Laplacian matrix $\tilde{\mathbf{L}}_0$ based on the mean index positions of the amino acids. The largest off-diagonal entry in (b) $\hat{\mathbf{L}}_0$ in terms of absolute value corresponds to the amino acid pair (D, E) since D and E have the highest mean index positions. After normalization, the diagonal entries in (c) $\tilde{\mathbf{L}}_0$ become ones, and the off-diagonal entries take values between -1 and 0.
  • Figure 2: Pipeline of topology-enhanced machine learning model (Top-ML) for ACPs prediction. (A) The peptide sequences from AntiCP 2.0 (Dataset A or B). (B) The peptide spectral features are obtained from the sequence-based Laplacian matrices. (C) The Magnus vectors, natural vectors, and terminal composition features are generated from peptide sequences. (D) The combined features serve as inputs for the classification of ACPs and non-ACPs.
  • Figure 3: Feature importance of the Top-ML model, averaged over 100 iterations of training using the mACPpred 2.0 training set. Feature importance is computed using the Gini importance criterion. We identified the three highest peaks in the top plot, highlighted with markers, and plotted their value distributions for each class in the bottom plot.
  • Figure 4: Illustration of boundary and combinatorial Laplacian matrices for an oriented simplicial complex.
  • Figure :