Table of Contents
Fetching ...

A data driven approach to classify descriptors based on their efficiency in translating noisy trajectories into physically-relevant information

Simone Martino, Domiziano Doria, Chiara Lionello, Matteo Becchi, Giovanni M. Pavan

TL;DR

The paper tackles the challenge of extracting physically meaningful information from noisy molecular trajectories by proposing a purely data-driven, agnostic framework to compare descriptors. It leverages Onion Clustering across multiple time resolutions to quantify descriptor efficiency via the number of resolvable environments and information loss, incorporating spatial denoising to assess noise effects. The results show that advanced descriptors like SOAP and LENS excel in raw data, but simple descriptors such as $d_5$, $N_{neigh}$, and $v$ can match or surpass them after denoising, with $d_5$ even distinguishing subregions of the interface. An evaluation space built from a max-resolved-information criterion offers a general, parameter-free method to compare descriptors and identify an optimal analysis framework for complex, noisy trajectories across systems and scales.

Abstract

Reconstructing the physical complexity of many-body dynamical systems can be challenging. Starting from the trajectories of their constitutive units (raw data), typical approaches require selecting appropriate descriptors to convert them into time-series, which are then analyzed to extract interpretable information. However, identifying the most effective descriptor is often non-trivial. Here, we report a data-driven approach to compare the efficiency of various descriptors in extracting information from noisy trajectories and translating it into physically relevant insights. As a prototypical system with non-trivial internal complexity, we analyze molecular dynamics trajectories of an atomistic system where ice and water coexist in equilibrium near the solid/liquid transition temperature. We compare general and specific descriptors often used in aqueous systems: number of neighbors, molecular velocities, Smooth Overlap of Atomic Positions (SOAP), Local Environments and Neighbors Shuffling (LENS), Orientational Tetrahedral Order, and distance from the fifth neighbor ($d_5$). Using Onion Clustering -- an efficient unsupervised method for single-point time-series analysis -- we assess the maximum extractable information for each descriptor and rank them via a high-dimensional metric. Our results show that advanced descriptors like SOAP and LENS outperform classical ones due to higher signal-to-noise ratios. Nonetheless, even simple descriptors can rival or exceed advanced ones after local signal denoising. For example, $d_5$, initially among the weakest, becomes the most effective at resolving the system's non-local dynamical complexity after denoising. This work highlights the critical role of noise in information extraction from molecular trajectories and offers a data-driven approach to identify optimal descriptors for systems with characteristic internal complexity.

A data driven approach to classify descriptors based on their efficiency in translating noisy trajectories into physically-relevant information

TL;DR

The paper tackles the challenge of extracting physically meaningful information from noisy molecular trajectories by proposing a purely data-driven, agnostic framework to compare descriptors. It leverages Onion Clustering across multiple time resolutions to quantify descriptor efficiency via the number of resolvable environments and information loss, incorporating spatial denoising to assess noise effects. The results show that advanced descriptors like SOAP and LENS excel in raw data, but simple descriptors such as , , and can match or surpass them after denoising, with even distinguishing subregions of the interface. An evaluation space built from a max-resolved-information criterion offers a general, parameter-free method to compare descriptors and identify an optimal analysis framework for complex, noisy trajectories across systems and scales.

Abstract

Reconstructing the physical complexity of many-body dynamical systems can be challenging. Starting from the trajectories of their constitutive units (raw data), typical approaches require selecting appropriate descriptors to convert them into time-series, which are then analyzed to extract interpretable information. However, identifying the most effective descriptor is often non-trivial. Here, we report a data-driven approach to compare the efficiency of various descriptors in extracting information from noisy trajectories and translating it into physically relevant insights. As a prototypical system with non-trivial internal complexity, we analyze molecular dynamics trajectories of an atomistic system where ice and water coexist in equilibrium near the solid/liquid transition temperature. We compare general and specific descriptors often used in aqueous systems: number of neighbors, molecular velocities, Smooth Overlap of Atomic Positions (SOAP), Local Environments and Neighbors Shuffling (LENS), Orientational Tetrahedral Order, and distance from the fifth neighbor (). Using Onion Clustering -- an efficient unsupervised method for single-point time-series analysis -- we assess the maximum extractable information for each descriptor and rank them via a high-dimensional metric. Our results show that advanced descriptors like SOAP and LENS outperform classical ones due to higher signal-to-noise ratios. Nonetheless, even simple descriptors can rival or exceed advanced ones after local signal denoising. For example, , initially among the weakest, becomes the most effective at resolving the system's non-local dynamical complexity after denoising. This work highlights the critical role of noise in information extraction from molecular trajectories and offers a data-driven approach to identify optimal descriptors for systems with characteristic internal complexity.

Paper Structure

This paper contains 8 sections, 6 equations, 7 figures.

Figures (7)

  • Figure 1: Ice/liquid water coexistence MD simulation, descriptors and LENS clustering. A: Flowchart of the information extracting procedure through the use of a generic descriptor $D_i$. B: Snapshot of the simulation. Oxygen atoms are colored red, and hydrogen atoms are colored white. C: Schematic representation of a static descriptor, $D_i(t, R)$, which depends on the coordinates and/or identities of molecules within a cutoff radius $R$ at time $t$. D: Schematic representation of a dynamic descriptor (LENS), which depends on the variation in the molecules' identities within a cutoff radius $R$ at times $t$ and $t+\delta t$. E: LENS signal time-series, for each particle as a function of simulation time. The background is colored according to the thresholds between the three identified clusters: solid ice (gray), liquid water (blue) and interface (red). The KDE of the signals (gray shaded area) is overlaid with the Gaussian distributions fitted by the clustering algorithm. Dashed lines represent the thresholds between the clusters. F: Typical Onion Clustering output plot, showing the number of environments detected (blue line) and the fraction of unclassified data points (orange line), as a function of the time resolution $\Delta t$. The red dashed lines indicates the time resolution ($\Delta t = 0.3$ ns) used for the clustering shown in this figure. G: Simulation snapshot where molecules are colored according to their cluster assignment (for $\Delta t = 0.3$ ns). Unclassified molecules are colored purple.
  • Figure 2: Comparison between descriptors. For the different descriptors -- A: number of neighbors $N_\text{neigh}$, B: distance from the fifth atom $d_5$, C: molecule velocity modulus $v$ and D: orientational tetrahedral order parameter $q_\text{tet}$ -- we show, from left to right: a schematic representation, the signal distribution over the entire simulation, clustered KDE, the number of resolved environments (blue line) and fraction of unclassified data (orange line), and a clustering-colored snapshot in correspondence to the time resolution highlighted with the red line.
  • Figure 3: Clustering results on SOAP. A: Projection of SOAP on the first two PCs. B: Projection of spatially averaged SOAP over the first two PCs. C: First three PCs' explained variance of raw SOAP (blue bars) and spatially averaged (orange bars). D: Onion Clustering output on SOAP; from left to right, signal distribution, clustered KDE, number of resolved environments (blue line), fraction of unclassified data (orange line) and clustering-colored snapshot in correspondence to the time resolution highlighted with the red line. E: Onion Clustering output on spatially averaged SOAP; same as in panel D.
  • Figure 4: Comparison between descriptors after local noise reduction. For each spatially averaged descriptor -- A: number of neighbors $N_\text{neigh}$, B: distance from the fifth atom $d_5$, C: molecule velocity modulus $v$ and D: orientational tetrahedral order parameter $q_\text{tet}$ -- we show, from left to right: a schematic representation, the signal distribution over the entire simulation, clustered KDE, the number of resolved environments (blue line) and fraction of unclassified data (orange line), and a clustering-colored snapshot in correspondence to the time resolution highlighted with the red line.
  • Figure 5: Evaluation space. ( A) $\chi$ parameter for each raw descriptor, as a function of the time resolution $\Delta t$. ( B) $\chi$ parameter for each denoised descriptor, as a function of the time resolution $\Delta t$. ( C) Two dimensional projection of the "evaluation space", using the first and the third PCs. ( D) Hierarchical clustering results and distance matrix between all the descriptors studied. ( E) Amplitude of the denoising effect, obtained computing the distances between raw version of the descriptors and their respective denoised one.
  • ...and 2 more figures