Table of Contents
Fetching ...

Data-driven assessment of optimal spatiotemporal resolutions for information extraction in noisy time series data

Domiziano Doria, Simone Martino, Matteo Becchi, Giovanni M. Pavan

TL;DR

An unsupervised approach that allows learning the characteristic length scales of the dominant key events/processes and the optimal spatiotemporal resolutions to characterize them, which proves to be related to the characteristic spatiotemporal length scales of the local/collective physical events dominating it.

Abstract

In general, comprehension of any type of complex system depends on the resolution used to examine the phenomena occurring within it. However, identifying a priori, for example, the best time frequencies/scales to study a certain system over-time, or the spatial distances at which correlations, symmetries, and fluctuations are, most often non-trivial. Here we describe an unsupervised approach that, starting solely from the data of a system, allows learning the characteristic length scales of the dominant key events/processes and the optimal spatiotemporal resolutions to characterize them. We tested this approach on time series data obtained from simulation or experimental trajectories of various example many-body complex systems ranging from the atomic to the macroscopic scale and having diverse internal dynamic complexities. Our method automatically analyzes the system data by analyzing correlations at all relevant inter-particle distances and at all possible inter-frame intervals in which their time series can be subdivided, namely, at all space and time resolutions. The optimal spatiotemporal resolution for studying a certain system thus maximizes information extraction and classification from the system's data, which we prove to be related to the characteristic spatiotemporal length scales of the local/collective physical events dominating it. This approach is broadly applicable and can be used to optimize the study of different types of data (static distributions, time series, or signals). The concept of 'optimal resolution' has a general character and provides a robust basis for characterizing any type of system based on its data, as well as to guide data analysis in general.

Data-driven assessment of optimal spatiotemporal resolutions for information extraction in noisy time series data

TL;DR

An unsupervised approach that allows learning the characteristic length scales of the dominant key events/processes and the optimal spatiotemporal resolutions to characterize them, which proves to be related to the characteristic spatiotemporal length scales of the local/collective physical events dominating it.

Abstract

In general, comprehension of any type of complex system depends on the resolution used to examine the phenomena occurring within it. However, identifying a priori, for example, the best time frequencies/scales to study a certain system over-time, or the spatial distances at which correlations, symmetries, and fluctuations are, most often non-trivial. Here we describe an unsupervised approach that, starting solely from the data of a system, allows learning the characteristic length scales of the dominant key events/processes and the optimal spatiotemporal resolutions to characterize them. We tested this approach on time series data obtained from simulation or experimental trajectories of various example many-body complex systems ranging from the atomic to the macroscopic scale and having diverse internal dynamic complexities. Our method automatically analyzes the system data by analyzing correlations at all relevant inter-particle distances and at all possible inter-frame intervals in which their time series can be subdivided, namely, at all space and time resolutions. The optimal spatiotemporal resolution for studying a certain system thus maximizes information extraction and classification from the system's data, which we prove to be related to the characteristic spatiotemporal length scales of the local/collective physical events dominating it. This approach is broadly applicable and can be used to optimize the study of different types of data (static distributions, time series, or signals). The concept of 'optimal resolution' has a general character and provides a robust basis for characterizing any type of system based on its data, as well as to guide data analysis in general.

Paper Structure

This paper contains 21 sections, 1 equation, 10 figures.

Figures (10)

  • Figure 1: Extracting information from, e.g., LENS time series of ice/water dynamic coexistence simulation trajectories. A Scheme of example local dynamical events captured by the LENS descriptor (permutation, addition, or subtraction of neighbors). B LENS signals for all the water molecules (their oxygen atoms) in the system as a function of simulation time, and their cumulative distribution (KDE: on the right, in blue). C Same LENS time series data with the background colored based on the three main micro-clusters detected by Onion Clustering (using a time-resolution of $\Delta t=1.1$ ns). The gaussian LENS environments are shown in the KDE (right) as solid gaussian curves, the inter-cluster thresholds are indicated as dotted horizontal lines. D Output Onion plot. The curve in blue (primary $y$-axis) shows the number of clusters classifiable by Onion Clustering as a function of the time resolution $\Delta t$. The curve in orange (secondary $y$-axis) shows the fraction of unclassifiable data points (stored into a cluster named ENV0) as a function of $\Delta t$ (i.e., fraction of dynamical events occurring faster than the resolution of the analysis). The vertical red dashed line indicates the time-resolution $\Delta t = 1.1$ ns used for the other panels (clusters and thresholds in panel C). E MD snapshot of the ice/water coexistence simulation showing the TIP4P/ICE molecules (left) and coloring them based on the clusters classified at the example time-resolution, which correspond to the bulk of ice (in gray), to the bulk of the liquid phase (in blue), and to the ice/water interface (in red). In white are the unclassifiable points: these are domains in the liquid which freeze and re-melt faster than $\Delta t=1.1$ ns becchi_layer-by-layer_2024Capelli2022Ephemeralcrippa_detecting_2023caruso_timesoap_2023. F Schematic representation of the solvation shells accounting for the first, second, third, etc., neighbor particles around a unit i in a generic system: using different cutoff radii means capturing different types of events (e.g., local vs. non-local) whose relevance depends on the physics of system and is often not clear a priori.
  • Figure 2: Effect of spatial and temporal resolutions on information extraction from LENS time series data.A Radial distribution function $g(r)$ of the oxygen atoms of all water molecules in the system: relevant $g(r)$ minima are detected (red circles) and used as critical cutoff radii $r_c$ to calculate LENS signals retaining information on local vs. non-local events/phenomena. B-G Onion plots obtained from the analyses of the LENS time series with different $r_c$. The number of resolved micro-clusters is shown in blue, while the fraction of unclassifiable data (in the ENV0 cluster) is shown in orange. The red vertical dashed line shows the example time resolution of $\Delta t=1.1$ ns used for the snapshots in panel I. H In blue: mean number of classifiable micro-clusters (ENVs) before the fraction of unclassifiable data reaches 50%, as a function of the $r_c$ used in the analysis. The data show a clear trend (blue curve), where the maximum efficiency in information extraction-and-discretization is obtained at $r_c \sim 10$ Å: namely, when accounting for events involving up to the 3rd-4th neighbors shell. I Representative MD snapshot of the system where the water molecules are colored according to the clustering obtained with the different $r_c$ in the analyses of panels B-G at the example time-resolution of $\Delta t=1.1$ ns: the red ice/water interface can be resolved in the spatial resolution range of $8.5 \leq r_c \leq 15$ Å, and for temporal resolutions higher than $\Delta t < 10$ ns.
  • Figure 3: Effect of spatial and temporal resolutions on information extraction from, e.g., SOAP time series data.A Schematic representation of SOAP. For each molecule $i$ in the system, at each sampled timestep $t$, the SOAP vector contains information on the distances and spatial displacements of the neighbors within a sphere of cutoff $r_c$: the SOAP power spectrum is thus a fingerprint of the local neighbors density around every SOAP center (the oxygen atoms of each water molecule, in this case). B-G Onion plots reporting the results of the clustering analyses of the SOAP PC1 time series as a function of the cutoff $r_c$ values used to calculate the SOAP power spectra. H In green: mean number of classifiable micro-clusters (ENVs) before the fraction of unclassifiable data reaches 50%, as a function of the $r_c$ used in the analysis. The data show a clear trend (green curve), where the maximum efficiency in information extraction-and-discretization is obtained at $13 \leq r_c \leq 15$ Å, where the interface is detected in a robust way down to time-resolutions of $\Delta t<20$ ns.
  • Figure 4: Optimal resolution in the study of a dynamic metal surface.A MD snapshots of a Cu surface composed of 2400 atoms simulated at $T=600$ K: the atoms are colored according to their coordination number. Top: crystalline surface at $T = 0$ K before simulation start; bottom: representative snapshot of the surface equilibrated at $T = 600$ K. B Onion plot: number of clusters (in blue) classified from the LENS time series obtained with cutoff radius $r_c = 5$ Å, and fraction of unclassifiable data points (orange) as a function of the time-resolution $\Delta t$ used in the analysis. The vertical red dashed line indicates $\Delta t = 0.12$ ns as an example time-resolution used to plot the analysis results in panels C-D. C LENS time series for all Cu atoms in the surface model along the entire MD simulation of $\tau=150$ ns (sampled every $\Delta\tau=10$ ps - see Methods for complete details). At the resolution of $\Delta t = 0.12$ ns, Onion Clustering classifies six different micro-clusters (colored areas) whose thresholds are identified by horizontal colored dashed lines. D Top view of the simulation box at three different simulation times: $t=40$, $61$, and $150$ ns. An example Cu atom (ID = 144) is highlighted in black, whose history trajectory up to that point is colored according to the LENS micro-cluster it belonged to. At 150 ns, a red/orange straight vertical sliding on the (211) surface edge is clearly visible. E Radial distribution function $g(r)$ of the Cu atoms in the system: relevant spatial cutoffs are highlighted by red circles (relevant $g(r)$ minima). F-M Onion plots reporting the results of the analyses on LENS time series with different $r_c$. Same color coding as before. N Mean number of clusters (ENVs) classifiable (before ENV0 reaches 50%) as a function of the $r_c$ used in the analysis. In this case, dominated by local single-atom dynamical events, the optimal resolution is achieved at the smallest $r_c$: i.e., when accounting only for the first neighbors shell.
  • Figure 5: Optimal resolution and characteristic length-scales in experimental complex systems: e.g., collective waves in Quincke rollers colloids.A Cartoon representation of Quincke rollers: dielectric colloidal microparticles confined in the $xy$ plane that, immersed in a conducting fluid, give rise to complex collective motions, waves, vortexes, etc., under exposure to a weak DC electric field (orthogonal to the $xy$ plane). We analyze trajectories resolved from an experimental microscopy movie taken from Ref. liu_activity_2021, where 6921 Quincke rollers undergo a collective wave that crosses a microscopy field of $700\times700$$\mu$m$^2$ left-to-right along $\tau = 0.25$ s of observation. We measured the average local velocity alignment using a sphere of radius $r_c$ (Eq. \ref{['eq:eq1']}). B Radial distribution function $g(r)$ of the particles in the system, and characteristic $r_c$ distances identified by red circles. C Onion plot for the analysis of the $\phi$ time series computed with cutoff radius $r_c = 58.8$$\mu$m: number of clusters (in blue) and fraction of unclassifiable data (orange) as a function of the time-resolution $\Delta t$ (the red dashed line identifies the $\Delta t = 2$ ms time-resolution, for which results are shown as an example in panels D-E). D The five clusters resolved by Onion Clustering at the example time-resolution, colored in gray, blue, yellow, green, and red (ordered with increasing $\phi$). In red is clearly visible the core of the wave crossing the microscopy field (see also Supplementary Movie S1). E Four snapshots taken from the trajectory, where the Quincke rollers are colored according to the cluster they belong to. F-H Onion results of the analysis of the $\phi$ time series calculated with different $r_c$ values. In this case, the maximum efficiency in information extraction-and-classification is encountered at $r_c=91$$\mu$m, corresponding to the $\sim5th$ neighbors shell. I Number of clusters resolved in each system, rescaled so that the maximum of each curve is equal to 1, as a function of the cutoff radius $r_c$ used for the computation of the descriptor. In order to compare the different systems, $r_c$ is expressed in multiples of the first neighbors shell radius $a$.
  • ...and 5 more figures