Table of Contents
Fetching ...

A Pipeline for Data-Driven Learning of Topological Features with Applications to Protein Stability Prediction

Amish Mishra, Francis Motta

TL;DR

This work advances a data-driven pipeline that learns topological features from protein structure data to predict stability, demonstrating that topology-based features can approach the predictive power of expert-curated biophysical features. By combining persistent homology with CDER, the authors obtain compact, interpretable descriptors that capture loop and void structures across multiple homology dimensions, and they show these features correlate strongly with SME descriptors. The study shows that CDER features alone achieve 92–99% of SME performance and can offer modest improvements when fused with SME features for certain topologies, suggesting that topology can reveal complementary discriminative information. The approach provides a general TDA-ML pipeline for extracting informative, interpretable topological descriptors from biomolecular data, with potential applications beyond protein stability and toward broader structure-function analyses.

Abstract

In this paper, we propose a data-driven method to learn interpretable topological features of biomolecular data and demonstrate the efficacy of parsimonious models trained on topological features in predicting the stability of synthetic mini proteins. We compare models that leverage automatically-learned structural features against models trained on a large set of biophysical features determined by subject-matter experts (SME). Our models, based only on topological features of the protein structures, achieved 92%-99% of the performance of SME-based models in terms of the average precision score. By interrogating model performance and feature importance metrics, we extract numerous insights that uncover high correlations between topological features and SME features. We further showcase how combining topological features and SME features can lead to improved model performance over either feature set used in isolation, suggesting that, in some settings, topological features may provide new discriminating information not captured in existing SME features that are useful for protein stability prediction.

A Pipeline for Data-Driven Learning of Topological Features with Applications to Protein Stability Prediction

TL;DR

This work advances a data-driven pipeline that learns topological features from protein structure data to predict stability, demonstrating that topology-based features can approach the predictive power of expert-curated biophysical features. By combining persistent homology with CDER, the authors obtain compact, interpretable descriptors that capture loop and void structures across multiple homology dimensions, and they show these features correlate strongly with SME descriptors. The study shows that CDER features alone achieve 92–99% of SME performance and can offer modest improvements when fused with SME features for certain topologies, suggesting that topology can reveal complementary discriminative information. The approach provides a general TDA-ML pipeline for extracting informative, interpretable topological descriptors from biomolecular data, with potential applications beyond protein stability and toward broader structure-function analyses.

Abstract

In this paper, we propose a data-driven method to learn interpretable topological features of biomolecular data and demonstrate the efficacy of parsimonious models trained on topological features in predicting the stability of synthetic mini proteins. We compare models that leverage automatically-learned structural features against models trained on a large set of biophysical features determined by subject-matter experts (SME). Our models, based only on topological features of the protein structures, achieved 92%-99% of the performance of SME-based models in terms of the average precision score. By interrogating model performance and feature importance metrics, we extract numerous insights that uncover high correlations between topological features and SME features. We further showcase how combining topological features and SME features can lead to improved model performance over either feature set used in isolation, suggesting that, in some settings, topological features may provide new discriminating information not captured in existing SME features that are useful for protein stability prediction.
Paper Structure (18 sections, 8 equations, 9 figures, 1 table)

This paper contains 18 sections, 8 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: Toy Example of TDA-CDER pipeline. (A) Example of noisy point clouds sampled from a sphere (left) and a figure-8 (right). Orange triangles show structure added to the figure-8 by connecting points that are within a certain scale (0.58). (B) $H_1$ PD of the noisy figure-8 in (A), which shows the presence of topological holes and their persistence over scales. The point in the shaded region corresponds to the smaller hole already formed at the scale (0.58) indicated in (A). The higher persistence point to the right of the shaded region will be "born" as the scale parameter increases and closes the open loop in the noisy figure-8. At a large enough scale, the small hole will "die" as the hole is filled in. The persistence of a topological feature is the difference in its birth and death scales. (C) $H_1$ PDs of 50 of each randomly sampled noisy spheres and figure-8s. (D) Features of the PDs learned by CDER to be discriminating between spheres and figure-8s. CDER ignores regions with points common in spheres and figure-8s.
  • Figure 2: Histograms of distribution of stability scores by topology. The numbers in the upper left/upper right of each frame display the number of stability scores that are less/greater than 1, respectively.
  • Figure 3: The atomic coordinates of a sample protein with EEHEE secondary structure topology are on the left, $H_0$, $H_1$, and $H_2$ persistence diagrams are in the middle, and the transformed diagrams are on the right. The transformed $H_0$ diagram has points distributed along one dimension because all points have the same birth. The units on all axes are angstroms (Å).
  • Figure 4: Hexbin plot of $H_1$ persistence pairs from the transformed persistence diagrams of all proteins with secondary structure EEHEE. Each hexagon is colored based on the number of points in that region that correspond to proteins labeled stable (green) or unstable (red). The color scale is logarithmic with deep red corresponding to a negative difference in the counts of the points corresponding to stable/unstable proteins (i.e. the number of points from persistence diagrams corresponding to unstable proteins outnumbered the number of points corresponding to stable proteins.) Similarly, deep green signifies that points corresponding to stable proteins outnumber the points corresponding to unstable proteins in that region. Yellow regions signify an equal number of points corresponding to stable and unstable proteins.
  • Figure 5: Top 10 feature importances in ascending order of mean decrease in impurity when training a random forest classifier on all downsampled proteins for each topology. There was no train/test split done in generating this figure. (For full descriptions of the SME features, see Table \ref{['tab:sme_features']} in Section \ref{['sec:sme_feat_descrip']}.)
  • ...and 4 more figures

Theorems & Definitions (3)

  • Definition 2.1
  • Definition 2.2
  • Definition 2.3