Table of Contents
Fetching ...

Determining the optimal structural resolution of proteins through an information-theoretic analysis of their conformational ensemble

Margherita Mele, Raffaele Fiorentini, Thomas Tarenzi, Giovanni Mattiotti, Raffaello Potestio

TL;DR

This work introduces PROPRE, an unsupervised information-theoretic framework to determine the minimal structural detail needed to faithfully describe a protein's conformational space. By systematically decimating heavy atoms, clustering MD frames with an RSD-based distance, and evaluating resolution and relevance via CVS, PROPRE identifies an optimal number of retained atoms $N_{ ext{OPT}}$ that maximizes informative content while minimizing detail. Across 11 diverse proteins, $N_{ ext{OPT}}$ scales linearly with system size, averaging about four heavy atoms per residue, a regime consistent with coarse-grained models like MARTINI and SIRAH; the optimal resolution also depends on the extent of conformational exploration. The framework provides a principled link between atomistic detail and coarse-grained representation, guiding multiscale modeling and offering insight into the structure–dynamics–function relationship in proteins, with data and code freely available for reproducibility.

Abstract

The choice of structural resolution is a fundamental aspect of protein modelling, determining the balance between descriptive power and interpretability. Although atomistic simulations provide maximal detail, much of this information is redundant to understand the relevant large-scale motions and conformational states. Here, we introduce an unsupervised, information-theoretic framework that determines the minimal number of atoms required to retain a maximally informative description of the configurational space sampled by a protein. This framework quantifies the informativeness of coarse-grained representations obtained by systematically decimating atomic degrees of freedom and evaluating the resulting clustering of sampled conformations. Application to molecular dynamics trajectories of dynamically diverse proteins shows that the optimal number of retained atoms scales linearly with system size, averaging about four heavy atoms per residue--remarkably consistent with the resolution of well-established coarse-grained models, such as MARTINI and SIRAH. Furthermore, the analysis shows that the optimal retained atoms number depends not only on molecular size but also on the extent of conformational exploration, decreasing for systems dominated by collective motions. The proposed method establishes a general criterion to identify the minimal structural detail that preserves the essential configurational information, thereby offering a new viewpoint on the structure-dynamics-function relationship in proteins and guiding the construction of parsimonious yet informative multiscale models.

Determining the optimal structural resolution of proteins through an information-theoretic analysis of their conformational ensemble

TL;DR

This work introduces PROPRE, an unsupervised information-theoretic framework to determine the minimal structural detail needed to faithfully describe a protein's conformational space. By systematically decimating heavy atoms, clustering MD frames with an RSD-based distance, and evaluating resolution and relevance via CVS, PROPRE identifies an optimal number of retained atoms that maximizes informative content while minimizing detail. Across 11 diverse proteins, scales linearly with system size, averaging about four heavy atoms per residue, a regime consistent with coarse-grained models like MARTINI and SIRAH; the optimal resolution also depends on the extent of conformational exploration. The framework provides a principled link between atomistic detail and coarse-grained representation, guiding multiscale modeling and offering insight into the structure–dynamics–function relationship in proteins, with data and code freely available for reproducibility.

Abstract

The choice of structural resolution is a fundamental aspect of protein modelling, determining the balance between descriptive power and interpretability. Although atomistic simulations provide maximal detail, much of this information is redundant to understand the relevant large-scale motions and conformational states. Here, we introduce an unsupervised, information-theoretic framework that determines the minimal number of atoms required to retain a maximally informative description of the configurational space sampled by a protein. This framework quantifies the informativeness of coarse-grained representations obtained by systematically decimating atomic degrees of freedom and evaluating the resulting clustering of sampled conformations. Application to molecular dynamics trajectories of dynamically diverse proteins shows that the optimal number of retained atoms scales linearly with system size, averaging about four heavy atoms per residue--remarkably consistent with the resolution of well-established coarse-grained models, such as MARTINI and SIRAH. Furthermore, the analysis shows that the optimal retained atoms number depends not only on molecular size but also on the extent of conformational exploration, decreasing for systems dominated by collective motions. The proposed method establishes a general criterion to identify the minimal structural detail that preserves the essential configurational information, thereby offering a new viewpoint on the structure-dynamics-function relationship in proteins and guiding the construction of parsimonious yet informative multiscale models.
Paper Structure (17 sections, 8 equations, 8 figures)

This paper contains 17 sections, 8 equations, 8 figures.

Figures (8)

  • Figure 1: Example of different decimation mappings of a protein. Panel (a) shows a protein with $N_{ha} = 2027$ heavy atoms. In panels (b), (c), and (d)$3$ different decimation mappings of the same protein are shown, selecting a subset of $N_{CG} = 100$ atoms out of $N_{ha}$.
  • Figure 2: Infographic that illustrates the steps of the first part of the PROPRE protocol: the construction of the $H_{\mathrm{res}},H_{\mathrm{rel}}$ scatter plot. After filtering the atomistic trajectory (by choosing a random decimation mapping of the heavy atoms), the RSD map is built, in order to perform a clustering of the filtered configurations. Each mapping corresponds to a point in the $H_{\mathrm{res}},H_{\mathrm{rel}}$ plane; by varying the retained atoms selection as well as the number of retained atoms one can reconstruct the curve as a scatter plot.
  • Figure 3: Plot of resolution ($H_{\mathrm{res}}$) vs. relevance ($H_{\mathrm{rel}}$) obtained for a trajectory of the enzyme adenylate kinase. Points are coloured according to the number of atoms retained in the corresponding mapping $N_{\mathrm{CG}}$. The black dashed lines represent the upper and lower bound to the theoretical maximum, derived in marsili2013samplinghaimovici2015criticality. The grey dotted line shows the typical behaviour of a structure-less sample obtained by averaging over multiple random partitions of M balls in an increasing number of boxes.
  • Figure 4: Graphical representation of the proteins comprising the dataset, colored according to the per-residue value of root-mean-square fluctuations as computed from the MD simulations.
  • Figure 5: (a,b) Scatter plots illustrating the number of optimal sites identified by PROPRE for each protein in relation to the number of residues; for both cases (striding and clustering) the correlation coefficient of the linear fit ($R^2$) is $0.98$, the Pearson correlation coefficient is $0.99$. (a) Analysis based on $1000$ equidistant frames extracted through striding. (b) Analysis based on the centroids of $1000$ clusters obtained from a UPGMA clustering procedure. Error bars indicate the standard deviation in the number of optimal sites within the most probable bin, as determined by PROPRE using the density-based protocol. (c) Box plots of the ratio between the number of optimal sites and the average sampled volume ($N_{\mathrm{OPT}} / V$) for the two strategies: striding (STR) and clustering (CLU). For STR, orange points indicate values for the AKE protein computed from the full trajectory, as well as separately for the open and closed conformations.
  • ...and 3 more figures