Towards a Benchmark for Markov State Models: The Folding of HP35
Daniel Nagel, Sofia Sartore, Gerhard Stock
TL;DR
This work addresses the challenge of establishing robust benchmarks for Markov state models (MSMs) of biomolecular dynamics by analyzing the HP35 folding trajectory. It builds a reference contact-based MSM using a $300\,\mu$s MD trajectory, and tests robustness by varying feature sets, dimensionality reduction, clustering, and temperature, revealing a hierarchical free-energy landscape with metastable substates in both native and unfolded basins. The study shows that several reasonable MSM workflows yield comparable implied timescales while differing in the partitioning of the landscape, underscoring the crucial role of feature selection and the potential pitfalls of mis-specified coordinates. Overall, HP35 serves as a practical benchmark for MSM methodology, providing insights into how to interpret state-based models and guiding future development of robust, interpretable coarse-grained descriptions of folding dynamics.
Abstract
Adopting a $300 \, μ$s-long molecular dynamics (MD) trajectory of the reversible folding of villin headpiece (HP35) published by D. E. Shaw Research, we recently constructed a Markov state model (MSM) of the folding process based on interresidue contacts [J. Chem. Theory Comput. 2023, ${\bf {19}}$, 3391]. The model reproduces the MD folding times of the system and predicts that both the native basin and the unfolded region of the free energy landscape are partitioned into several metastable substates that are structurally well characterized. Recognizing the need to establish well-defined but nontrivial benchmark problems, in this Perspective we study to what extent and in what sense this MSM may be employed as a reference model. To this end, we test the robustness of the MSM by comparing it to models that use alternative combinations of features, dimensionality reduction methods and clustering schemes. The study suggests some main characteristics of the folding of HP35, which should be reproduced by any other competitive model of the system. Moreover, the discussion reveals which parts of the MSM workflow matter most for the considered problem, and illustrates the promises and possible pitfalls of state-based models for the interpretation of biomolecular simulations.
