Table of Contents
Fetching ...

Towards a Benchmark for Markov State Models: The Folding of HP35

Daniel Nagel, Sofia Sartore, Gerhard Stock

TL;DR

This work addresses the challenge of establishing robust benchmarks for Markov state models (MSMs) of biomolecular dynamics by analyzing the HP35 folding trajectory. It builds a reference contact-based MSM using a $300\,\mu$s MD trajectory, and tests robustness by varying feature sets, dimensionality reduction, clustering, and temperature, revealing a hierarchical free-energy landscape with metastable substates in both native and unfolded basins. The study shows that several reasonable MSM workflows yield comparable implied timescales while differing in the partitioning of the landscape, underscoring the crucial role of feature selection and the potential pitfalls of mis-specified coordinates. Overall, HP35 serves as a practical benchmark for MSM methodology, providing insights into how to interpret state-based models and guiding future development of robust, interpretable coarse-grained descriptions of folding dynamics.

Abstract

Adopting a $300 \, μ$s-long molecular dynamics (MD) trajectory of the reversible folding of villin headpiece (HP35) published by D. E. Shaw Research, we recently constructed a Markov state model (MSM) of the folding process based on interresidue contacts [J. Chem. Theory Comput. 2023, ${\bf {19}}$, 3391]. The model reproduces the MD folding times of the system and predicts that both the native basin and the unfolded region of the free energy landscape are partitioned into several metastable substates that are structurally well characterized. Recognizing the need to establish well-defined but nontrivial benchmark problems, in this Perspective we study to what extent and in what sense this MSM may be employed as a reference model. To this end, we test the robustness of the MSM by comparing it to models that use alternative combinations of features, dimensionality reduction methods and clustering schemes. The study suggests some main characteristics of the folding of HP35, which should be reproduced by any other competitive model of the system. Moreover, the discussion reveals which parts of the MSM workflow matter most for the considered problem, and illustrates the promises and possible pitfalls of state-based models for the interpretation of biomolecular simulations.

Towards a Benchmark for Markov State Models: The Folding of HP35

TL;DR

This work addresses the challenge of establishing robust benchmarks for Markov state models (MSMs) of biomolecular dynamics by analyzing the HP35 folding trajectory. It builds a reference contact-based MSM using a s MD trajectory, and tests robustness by varying feature sets, dimensionality reduction, clustering, and temperature, revealing a hierarchical free-energy landscape with metastable substates in both native and unfolded basins. The study shows that several reasonable MSM workflows yield comparable implied timescales while differing in the partitioning of the landscape, underscoring the crucial role of feature selection and the potential pitfalls of mis-specified coordinates. Overall, HP35 serves as a practical benchmark for MSM methodology, providing insights into how to interpret state-based models and guiding future development of robust, interpretable coarse-grained descriptions of folding dynamics.

Abstract

Adopting a s-long molecular dynamics (MD) trajectory of the reversible folding of villin headpiece (HP35) published by D. E. Shaw Research, we recently constructed a Markov state model (MSM) of the folding process based on interresidue contacts [J. Chem. Theory Comput. 2023, , 3391]. The model reproduces the MD folding times of the system and predicts that both the native basin and the unfolded region of the free energy landscape are partitioned into several metastable substates that are structurally well characterized. Recognizing the need to establish well-defined but nontrivial benchmark problems, in this Perspective we study to what extent and in what sense this MSM may be employed as a reference model. To this end, we test the robustness of the MSM by comparing it to models that use alternative combinations of features, dimensionality reduction methods and clustering schemes. The study suggests some main characteristics of the folding of HP35, which should be reproduced by any other competitive model of the system. Moreover, the discussion reveals which parts of the MSM workflow matter most for the considered problem, and illustrates the promises and possible pitfalls of state-based models for the interpretation of biomolecular simulations.
Paper Structure (9 sections, 1 equation, 5 figures)

This paper contains 9 sections, 1 equation, 5 figures.

Figures (5)

  • Figure 1: The folding of villin headpiece (HP35). (a) Molecular structure of the native state, consisting of three $\alpha$-helices (residues 3--10, 14--19 and 22--32) connected by two loops. (b) Illustration of the seven main clusters of native contacts, obtained from a MoSAIC correlation analysisDiez22 of the contact distances. (c) Time evolution of the fraction of native contacts $Q$ obtained from the folding trajectory by Piana et al.,Piana12 showing reversible transitions between the unfolded region of the free energy landscape ($Q \lesssim 0.4$) and the native basin ($Q \gtrsim 0.7$).
  • Figure 2: Contact-based model of HP35, using principal component analysis (PCA), robust density-based clustering (RDC) and the most probable path algorithm (MPP). (a) MPP dendrogram illustrating the clustering of microstates into metastable states upon increasing the requested minimum metastability criterion $Q_\text{min}$ of a state. (b) Structural characterization of the twelve metastable states of HP35. The states are ordered by decreasing fraction of native contacts $Q$, the contacts are ordered according to the seven main clusters (Fig. \ref{['fig:HP35']}b). (c) First three implied timescales $t_n$ shown as a function of the lag time $\tau_{\text{lag}}$ and box-plot representation of the folding time distributions obtained from MSM and MD data. (d) Kinetic network of the MSM generated from the force-directed algorithm ForceAtlas2.ForceAtlas2 The node size indicates the population of the state and the color code reflects its fraction of native contacts $Q$.
  • Figure 3: (a) Effects of using G-PCCA instead of MPP for the construction of macrostates. The Sankey diagrams compare the states obtained for the reference model (PCA/RDC/MPP, on the left) and the alternative model (PCA/RDC/G-PCCA, on the right). (b) Performance of the method combination tICA/$k$-means/MPP as revealed by the Sankey plot comparison to the reference model and (c) the MPP dendrogram.
  • Figure 4: Effects of the choice of features, including (a) Cartesian coordinates, (b) selected C$_\alpha$-distances, and (c) contacts combined with the backbone dihedral angle $\phi_3$. Shown are the Sankey plots for comparison to the reference states and the MPP dendrogram in case (a).
  • Figure 5: Comparison of various methods and features, showing (top) the normalized entropy score $H$, (middle) the Davies-Bouldin index $\Delta$DBI = DBI$_{\rm max} -$DBI, and (bottom) the timescale score GMRQ, which all should be maximized.