A Pipeline for Data-Driven Learning of Topological Features with Applications to Protein Stability Prediction
Amish Mishra, Francis Motta
TL;DR
This work advances a data-driven pipeline that learns topological features from protein structure data to predict stability, demonstrating that topology-based features can approach the predictive power of expert-curated biophysical features. By combining persistent homology with CDER, the authors obtain compact, interpretable descriptors that capture loop and void structures across multiple homology dimensions, and they show these features correlate strongly with SME descriptors. The study shows that CDER features alone achieve 92–99% of SME performance and can offer modest improvements when fused with SME features for certain topologies, suggesting that topology can reveal complementary discriminative information. The approach provides a general TDA-ML pipeline for extracting informative, interpretable topological descriptors from biomolecular data, with potential applications beyond protein stability and toward broader structure-function analyses.
Abstract
In this paper, we propose a data-driven method to learn interpretable topological features of biomolecular data and demonstrate the efficacy of parsimonious models trained on topological features in predicting the stability of synthetic mini proteins. We compare models that leverage automatically-learned structural features against models trained on a large set of biophysical features determined by subject-matter experts (SME). Our models, based only on topological features of the protein structures, achieved 92%-99% of the performance of SME-based models in terms of the average precision score. By interrogating model performance and feature importance metrics, we extract numerous insights that uncover high correlations between topological features and SME features. We further showcase how combining topological features and SME features can lead to improved model performance over either feature set used in isolation, suggesting that, in some settings, topological features may provide new discriminating information not captured in existing SME features that are useful for protein stability prediction.
