Determining the optimal structural resolution of proteins through an information-theoretic analysis of their conformational ensemble
Margherita Mele, Raffaele Fiorentini, Thomas Tarenzi, Giovanni Mattiotti, Raffaello Potestio
TL;DR
This work introduces PROPRE, an unsupervised information-theoretic framework to determine the minimal structural detail needed to faithfully describe a protein's conformational space. By systematically decimating heavy atoms, clustering MD frames with an RSD-based distance, and evaluating resolution and relevance via CVS, PROPRE identifies an optimal number of retained atoms $N_{ ext{OPT}}$ that maximizes informative content while minimizing detail. Across 11 diverse proteins, $N_{ ext{OPT}}$ scales linearly with system size, averaging about four heavy atoms per residue, a regime consistent with coarse-grained models like MARTINI and SIRAH; the optimal resolution also depends on the extent of conformational exploration. The framework provides a principled link between atomistic detail and coarse-grained representation, guiding multiscale modeling and offering insight into the structure–dynamics–function relationship in proteins, with data and code freely available for reproducibility.
Abstract
The choice of structural resolution is a fundamental aspect of protein modelling, determining the balance between descriptive power and interpretability. Although atomistic simulations provide maximal detail, much of this information is redundant to understand the relevant large-scale motions and conformational states. Here, we introduce an unsupervised, information-theoretic framework that determines the minimal number of atoms required to retain a maximally informative description of the configurational space sampled by a protein. This framework quantifies the informativeness of coarse-grained representations obtained by systematically decimating atomic degrees of freedom and evaluating the resulting clustering of sampled conformations. Application to molecular dynamics trajectories of dynamically diverse proteins shows that the optimal number of retained atoms scales linearly with system size, averaging about four heavy atoms per residue--remarkably consistent with the resolution of well-established coarse-grained models, such as MARTINI and SIRAH. Furthermore, the analysis shows that the optimal retained atoms number depends not only on molecular size but also on the extent of conformational exploration, decreasing for systems dominated by collective motions. The proposed method establishes a general criterion to identify the minimal structural detail that preserves the essential configurational information, thereby offering a new viewpoint on the structure-dynamics-function relationship in proteins and guiding the construction of parsimonious yet informative multiscale models.
