Table of Contents
Fetching ...

Maximum Information Extraction Via Clustering and Minimization of Shannon Entropy

Matteo Becchi, Giovanni Maria Pavan

TL;DR

A data-driven approach that employs Shannon entropy as a transferable metric to assess and quantify MInE from data via their clustering into statistically-relevant micro-domains is introduced, providing a robust parameter-free approach and quantitative metrics for data-analysis, and for the study of any type of system from its data.

Abstract

In the analysis of any type of system, granting maximum information extraction from its data is non-trivial. Confidence in successful information extraction typically builds on prior knowledge of the studied system or on the user's experience. However, a robust and objective criterion for ensuring maximum information extraction from data is difficult to define. Here, we introduce a data-driven approach that employs Shannon entropy as a transferable metric to assess and quantify Maximum Information Extraction (MInE) from data via their clustering into statistically-relevant micro-domains. The method is general and can be applied virtually to any type of data or system. We demonstrate its efficiency by analyzing, as a first example, time-series data extracted from molecular dynamics simulations of water and ice coexisting at the solid/liquid transition temperature. The method allows quantifying the information contained in the data distributions (time-independent component) and the additional information gain attainable by analyzing data as time-series (i.e., accounting for the information contained in data time-correlations). The different micro-domains that can be effectively resolved and classified in the system are characterized by own entropy, which are found consistent with experimentally known thermodynamic parameters. A second test case demonstrates how the MInE approach is also effective for high-dimensional datasets and clearly shows how including little informative, but noisy, extra components/features in high-dimensional analyses may be not only useless, but even detrimental to maximum information extraction. This provides a robust parameter-free approach and quantitative metrics for data-analysis, and for the study of any type of system from its data.

Maximum Information Extraction Via Clustering and Minimization of Shannon Entropy

TL;DR

A data-driven approach that employs Shannon entropy as a transferable metric to assess and quantify MInE from data via their clustering into statistically-relevant micro-domains is introduced, providing a robust parameter-free approach and quantitative metrics for data-analysis, and for the study of any type of system from its data.

Abstract

In the analysis of any type of system, granting maximum information extraction from its data is non-trivial. Confidence in successful information extraction typically builds on prior knowledge of the studied system or on the user's experience. However, a robust and objective criterion for ensuring maximum information extraction from data is difficult to define. Here, we introduce a data-driven approach that employs Shannon entropy as a transferable metric to assess and quantify Maximum Information Extraction (MInE) from data via their clustering into statistically-relevant micro-domains. The method is general and can be applied virtually to any type of data or system. We demonstrate its efficiency by analyzing, as a first example, time-series data extracted from molecular dynamics simulations of water and ice coexisting at the solid/liquid transition temperature. The method allows quantifying the information contained in the data distributions (time-independent component) and the additional information gain attainable by analyzing data as time-series (i.e., accounting for the information contained in data time-correlations). The different micro-domains that can be effectively resolved and classified in the system are characterized by own entropy, which are found consistent with experimentally known thermodynamic parameters. A second test case demonstrates how the MInE approach is also effective for high-dimensional datasets and clearly shows how including little informative, but noisy, extra components/features in high-dimensional analyses may be not only useless, but even detrimental to maximum information extraction. This provides a robust parameter-free approach and quantitative metrics for data-analysis, and for the study of any type of system from its data.

Paper Structure

This paper contains 22 sections, 8 equations, 6 figures.

Figures (6)

  • Figure 1: (a) Snapshot of the water/ice atomistic model system used as a case study. (b) Schematic of the SOAP descriptor, which provides a high-dimensional representation of the density, order/disorder, and symmetry of the arrangement of neighboring molecules around each molecule. (c) Denoised donkor2024beyond PC1 SOAP time-series lionello2025relevantmartino2024data for the 2048 molecules as a function of simulation time. In red: the signal of a representative molecule undergoing a water-to-ice transition. (d) Probability distribution $P(x)$ of the SOAP PC1 signals (gray). Two main clusters, corresponding to the liquid and ice phases (inset), are identified by the two maxima in $P(x)$ (orange and blue Gaussian fits). These clusters are readily detected using pattern recognition methods (see SI for details).
  • Figure 2: Information gain from Onion clustering as a function of the time resolution $\Delta t$ used in the analysis. Panel (a) shows results for the SOAP PC1 dataset with spatial averaging; panel (b) shows results for the LENS dataset. In both panels, the horizontal dashed line ($I_0$) represents the information content of the raw data distribution. The solid black curve ($I_\text{clust}$) represents the information retained after clustering, which varies with $\Delta t$. The gray-shaded area corresponds to the increase in information content achieved through clustering. The dotted line shows the information gain obtained when the dataset frames are randomly reshuffled, while the diagonally hatched area highlights the additional gain attributable to time correlations. The width of the colored regions reflects the weighted Shannon entropy $f_k H_k$ of each environment, illustrating how entropy varies with $\Delta t$. In panel (a), $I_\text{clust}$ reaches a maximum at $\Delta t \sim 2$ ns (vertical red dashed line), indicating the optimal time resolution for extracting information from the SOAP PC1 time-series. The simulation snapshot is colored according to Onion clustering at this resolution: bulk ice and liquid water appear in blue and orange, respectively, with the solid–liquid interface in green. A small fraction of unclassifiable points appear as sparse molecules in purple (color coding matches that in the plot). On the right: The top snapshot shows the corresponding clustering, using the same color scheme as in panel (a). The bottom snapshot is the same one, colored by entropy difference calculated relative to that of the bulk of ice and converted in units of [J mol$^{-1}$ K$^{-1}$].
  • Figure 3: Optimizing descriptor choice and spatiotemporal resolutions for maximum information extraction. (a) Relative information gain $\Delta I/H_0$, obtained by applying Onion clustering to time-series data derived from different descriptors: LENS (blue), the first principal component (PC1) of SOAP (orange), the first time-lagged independent component (tIC1) of SOAP (red), and the number of neighbors $n_\text{neigh}$ (green). (b) Top: Radial distribution function $g(r)$ of the water molecules (oxygen atoms only), computed over the entire simulation trajectory. Red dots mark the minima in $g(r)$, corresponding to solvation shells, used here as characteristic cutoff values $r_c$ for computing descriptor time-series. Bottom: Maximum relative information gain $\Delta I/H_0$ attainable via Onion clustering on LENS and SOAP PC1 time-series calculated using different cutoff radii $r_c$.
  • Figure 4: (a) From left to right, the trajectories of 100 particles in a bi-dimensional energy landscape with 4 and 2 minima respectively, and the probability distribution of the $y$ coordinate $P(y)$. The clusters identified by Onion clustering in the two systems, using the full $(x, y)$ trajectories or only the $y$ coordinate, are shown in blue, orange, green and red respectively. (b) Information gain $\Delta I$ from Onion clustering as a function of time resolution $\Delta t$ (in simulation time-steps) used for the analysis. For each of the two example model systems (left and right panels), three different datasets are analyzed: different color curves matching those in panel (a). For the system with four minima (left), clustering with $(x,y)$ (blue curve) allows resolving all four A-D minima, yielding at maximum twice the information gain compared to that attainable when clustering using variable $y$ only (orange: which allows resolving only two minima), but the performance degrades an order of magnitude earlier ($\Delta t\sim10$ vs. $100$ frames). For the two-minima system (right), clustering with $(x,y)$ (green) and with $y$ (red) achieves the same maximum gain (allowing to resolve in both cases both A-B minima), though degradation is again faster for the bi-variate case. In both systems, we added a third case (gray curve) demonstrating how adding a third $z$ coordinate, which does not bring additional relevant information but just noise (the minima become spherical in three dimension), does not increase the attainable information gain but also accelerates analysis degradation confining it to lower $\Delta t$.
  • Figure S1: Number of cluster discovered (red) and fraction of unclassified data points (black) as a function of the time resolution $\Delta t$ used for the onion clustering on SOAP PC1 data.
  • ...and 1 more figures