Table of Contents
Fetching ...

Identifying statistical indicators of temporal asymmetry using a data-driven approach

Teresa Dalle Nogare, Ben D. Fulcher

TL;DR

This study tackles the problem of identifying statistical indicators of temporal asymmetry in time-series by conducting the largest comparative benchmark to date: over 6000 features from the hctsa library evaluated across 35 diverse dynamical systems with known time-reversal properties. The authors show that while many features are inherently time-reversal invariant, a substantial tail of statistics—notably generalized autocorrelations, symbolic motifs, and forecasting-based measures—efficiently quantify irreversibility, yet no single statistic is universally optimal. A key finding is that irreversibility is highly process-dependent, requiring tailoring of the statistic to the specific form of time-reversal symmetry breaking present. The work provides a unified framework linking diverse time-series approaches to irreversibility and offers practical guidance for connecting observable temporal patterns to underlying dissipative or nonlinear mechanisms in complex systems, with potential relevance to non-equilibrium thermodynamics.

Abstract

The dynamics of time-reversible systems are statistically indistinguishable when observed forward or backward in time. A rich literature of statistical methods to distinguish irreversible dynamics from the reversible dynamics of linear, Gaussian systems can provide insights into underlying mechanisms and aid modeling and statistical quantification of time-series data. But these existing time-reversibility metrics have been developed individually, forming a fragmented body of research that makes it challenging to identify the most effective approaches developed to date, and the most promising new directions for development. Here we address these issues by systematically evaluating over 6000 time-series summary statistics, derived from across the time-series analysis literature, on their ability to distinguish the time-irreversibility of data simulated from a diverse range of 35 systems. Our large-scale data-driven comparison highlights the effectiveness of several key families of statistics, including time-asymmetric forms of generalized autocorrelation functions, time-series symbolic sequences, and forecasting-related methods. All irreversible systems studied here could be accurately distinguished by a well-chosen time-series statistic, but no single statistic could accurately index the statistical form of irreversibility for all irreversible systems. This challenges the assumption that a given time-reversibility statistic will accurately capture time reversibility in general, and underscores the importance of tailoring statistical approaches to the time-reversal characteristics of a given system. Our results provide a unified understanding of the key algorithmic structures through which irreversibility can be effectively quantified from data, providing a foundation for connecting patterns in time series to the underlying mechanisms of the systems that generate them.

Identifying statistical indicators of temporal asymmetry using a data-driven approach

TL;DR

This study tackles the problem of identifying statistical indicators of temporal asymmetry in time-series by conducting the largest comparative benchmark to date: over 6000 features from the hctsa library evaluated across 35 diverse dynamical systems with known time-reversal properties. The authors show that while many features are inherently time-reversal invariant, a substantial tail of statistics—notably generalized autocorrelations, symbolic motifs, and forecasting-based measures—efficiently quantify irreversibility, yet no single statistic is universally optimal. A key finding is that irreversibility is highly process-dependent, requiring tailoring of the statistic to the specific form of time-reversal symmetry breaking present. The work provides a unified framework linking diverse time-series approaches to irreversibility and offers practical guidance for connecting observable temporal patterns to underlying dissipative or nonlinear mechanisms in complex systems, with potential relevance to non-equilibrium thermodynamics.

Abstract

The dynamics of time-reversible systems are statistically indistinguishable when observed forward or backward in time. A rich literature of statistical methods to distinguish irreversible dynamics from the reversible dynamics of linear, Gaussian systems can provide insights into underlying mechanisms and aid modeling and statistical quantification of time-series data. But these existing time-reversibility metrics have been developed individually, forming a fragmented body of research that makes it challenging to identify the most effective approaches developed to date, and the most promising new directions for development. Here we address these issues by systematically evaluating over 6000 time-series summary statistics, derived from across the time-series analysis literature, on their ability to distinguish the time-irreversibility of data simulated from a diverse range of 35 systems. Our large-scale data-driven comparison highlights the effectiveness of several key families of statistics, including time-asymmetric forms of generalized autocorrelation functions, time-series symbolic sequences, and forecasting-related methods. All irreversible systems studied here could be accurately distinguished by a well-chosen time-series statistic, but no single statistic could accurately index the statistical form of irreversibility for all irreversible systems. This challenges the assumption that a given time-reversibility statistic will accurately capture time reversibility in general, and underscores the importance of tailoring statistical approaches to the time-reversal characteristics of a given system. Our results provide a unified understanding of the key algorithmic structures through which irreversibility can be effectively quantified from data, providing a foundation for connecting patterns in time series to the underlying mechanisms of the systems that generate them.

Paper Structure

This paper contains 18 sections, 28 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Schematic of our data-driven approach to identifying high-performing time-series statistics that can accurately index irreversibility from time-series data.(a) Time series generation: To include multiple and diverse sources of irreversibility, we simulated $5000$-sample time series from a comprehensive library of 35 discrete-time and continuous-time processes with known reversibility properties. Each panel shows selected illustrative time series segments from a given process class, with a [R] indicating 'reversible' and a [I] indicating 'irreversible', placed near the process label according to the reversibility specified in Table \ref{['tab:processes']}. We use a consistent color coding throughout the paper to identify these families of simulated processes. (b) Extraction of time-series features: For each time series $\bm x$, we constructed its time-reversed counterpart $\tilde{\bm x}$ by reversing the order of the data points, as per prior work arola-fernandez_irreversibility_2023lacasa_time_2015gonzalez-espinoza_arrow_2020camassa_temporal_2024. To find time-series properties that are sensitive to irreversibility, we systematically computed a large set of $>6000$ interpretable time-series features implemented in the hctsa library fulcher_hctsa_2017. In this way, each time series was summarized by a set of real numbers, $f_1, \dots, f_{6082}$, that encode a broad range of its statistical properties. As an ansatz for the ability of each time-series feature $f_i$ to index a difference in statistical properties between the original time series $\bm x$ and the time-reversed time series $\tilde{\bm x}$, we quantified the absolute difference between the feature value computed on $\bm x$ and $\tilde{\bm x}$, as $|\Delta f_i| = |f_i - \tilde{f}_i|$. (c) Assign each feature a score: To extract time-series properties that are sensitive to the reversibility of processes, each feature $f_i$ was scored based on the discriminative power of its $|\Delta f_i|$ in classifying time series generated by reversible [R] and irreversible [I] processes using a 1-nearest-neighbor classifier (1-NN) with leave-one-out cross-validation; the classification accuracy served as the feature's score. We expect that an informative feature $f_i$, associated with a high classification accuracy, displays $\Delta f_i \approx 0$ for reversible processes and large deviations $|\Delta f_i| > 0$ for irreversible ones, yielding well-separated distributions across the dataset.
  • Figure 2: Identifying high-performing and interpretable time-series features for detecting irreversibility through large-scale empirical testing of thousands of features on 35 reversible and irreversible processes. (a) Distribution of cross-validated classification accuracy (of distinguishing reversible from irreversible processes) across $4668$ features (excluding $1414$ that are insensitive to reversibility, cf. Sec. \ref{['subsec:zero_features']}). Each feature $f_i$ was assessed independently on its ability to distinguish time series generated from reversible versus irreversible processes via the absolute difference between its value computed on the time series and its time-reversed counterpart, as $|\Delta f_i|$ (Eq. \ref{['eq:deltaf']}). There is a tail in the accuracy distribution, pointing to a subset of high-performing features; here we focus on the $127$ features with accuracies exceeding $72\%$, annotated as a dashed gray vertical line (see Supplementary Material Table S3 for a full list). (b) Distributions of $|\Delta f|$ across all 3500 time series, separated between 1500 reversible (blue) and 2000 irreversible (red) time series, are shown as box plots for three selected top-performing features: i. the fourth-order cross-moment (an example of generalized autocorrelation-based feature) $C_{1,3}(\bm x;1)=\langle x_t\, x_{t+1}^3\rangle$; ii. the symbolic motif feature, $p_{\text{uu}}(\bm x)$, which calculates the probability of two consecutive rises in a time series, and iii. the mean absolute error (MAE) of 1-step-ahead predictions made by a second-order autoregressive (AR) model. Next to each boxplot is a raincloud plot showing the $|\Delta f|$ value for all time series and colored according to process families (defined in Table \ref{['tab:processes']}), with random horizontal scatter to aid visualization. Horizontal black lines indicate the zero baseline, while gray dashed lines delimit the range of values of the statistics computed from time series generated by reversible processes. The generalized autocorrelation (i.), symbolic motif (ii.), and MAE of a 1-step-ahead AR(2) prediction (iii.) are annotated in the distribution in Fig. \ref{['subfig:dist-a']}. These three features were chosen as demonstrative examples of broader families of easily interpretable time-reversal-sensitive time-series features, generalized autocorrelations (Sec. \ref{['subsec:generalized_ac']}), symbolic motif probabilities (Sec. \ref{['subsec:symbolic']}), and forecasting-based measures (Sec. \ref{['subsec:forecasting']}), which are explained in detail in the main text. Their behavior mirrors that of all top-performing features, exhibiting $|\Delta f| \approx 0$ for reversible processes but deviating substantially from zero for many irreversible processes.
  • Figure 3: Time-series features with time-symmetric constructions are invariant to time reversal, while time-asymmetric constructions can be powerful indices of time-reversibility. We illustrate this concept with respect to two families of time-series statistics: (a), (b): generalized autocorrelation functions; and (c), (d): the frequency of patterns of consecutive rises ('up': u) and falls ('down': d) in a symbolized transformation of the time series. (a) We depict three examples of generalized autocorrelation functions with time-symmetric constructions, visualized diagrammatically using a comb-like representation introduced here. In this representation, each vertical segment represents a time-series value at a specific time $t$, the spacing between groups of segments reflects the temporal lag between consecutive feature terms and the number of segments in each group represents the exponent, that is, the number of times a given time-series point contributes to the statistic. All of these features are symmetric in time with respect to their midpoint, indicated with a vertical dashed red line. (b) Three generalized autocorrelation functions with time-asymmetric constructions are depicted, with the asymmetry arising from the use of different exponents (or non-equally spaced temporal lags). (c) Three example time-symmetric sequences of successive rises ('up': u) and falls ('down': d) are depicted diagrammatically. The symmetry of these diagrams about their midpoint, depicted as a red dashed line, corresponds to the second half being the mirror image of the logic negation of the first half (indicated by the NOT operation, which transforms 'u' to 'd' and vice-versa). (d) Three examples of time-reversal asymmetric symbolic patterns (which are therefore candidates for indexing irreversibility) are depicted.
  • Figure 4: All statistical time-series features have strengths and weaknesses at detecting irreversibility across different processes, demonstrating the need to tailor statistical summaries to the specific sources of irreversibility in a given process.(a) Box plot with scatter points showing the left-out accuracy of the 127 top-performing features for five representative irreversible processes: autoregressive with uniform noise distribution (AR1_UNO), logistic map (with $r = 4$) (LOGISTIC_4), linear model with logistic map noise (LLOG), noise-driven sine map (SINE_MAP), and a linear projection of the multidimensional Lorenz system (LORENZ_SUM). Simulation details of the implemented models are in Appendix \ref{['app:models']}. The generalized autocorrelation feature $\langle x_t\, x_{t+1}^3 \rangle$ is highlighted using a yellow star and the symbolic feature $p_{\text{uu}}(\bm x)$ using an orange triangle. For each process, we computed the minimum and maximum left-out accuracies obtained by the set of top-performing features and compared these values across all irreversible processes. We then highlighted the feature with the lowest maximum accuracy across irreversible processes using a dashed line (corresponding to 99%), and similarly marked the feature with the highest minimum left-out accuracy (29%). We plot the distribution of absolute feature differences $|\Delta f|$ across: (i) all reversible time series; (ii) the autoregressive process with uniform noise (AR1_UNO); and (iii) the logistic map (with $r = 4$) (LOGISTIC_4), for two example features: (b) generalized autocorrelation $\langle x_t\, x_{t+1}^3 \rangle$; and (c) the symbolic sequence probability $p_{\text{uu}}$. In both plots, the lower solid line denotes the $|\Delta f| = 0$ baseline, while the upper dashed line delineates the maximum of the range of values observed across all time series generated from reversible processes. (d) Distribution of left-out accuracy across 2000 time series from the 20 irreversible processes for five representative features: two forms of generalized autocorrelation, namely the fourth-order statistic $\langle x_t\, x_{t+1}^3 \rangle$ and the 'generalized linear self-correlation function' $C^\prime_{1,2}(\bm x;1)$duarte_queiros_yet_2007, reported in Eq. \ref{['eqn:glscf']}, using $\alpha = 1$, $\beta = 2$, and $\tau = 1$; the probability of two successive increases (the 'uu' pattern) in a time series, $p_{\text{uu}}(\bm x)$; the mean absolute error of 1-step-ahead predictions made by a second-order AR model (labeled as 'MAE AR(2) predictor'); and the standard deviation of the residuals from a nonlinear prediction model (labeled as 'std dev residuals nonlinear prediction').
  • Figure B.1: Time-series features $\langle x_t^3\, x_{t+1}\rangle$, $p_{\text{uu}}(\bm x)$ and mean absolute error (MAE) of an AR(2) predictor as a function of the time-series length for a representative process from each family. We show the mean and standard deviation of the feature difference of the generalized autocorrelation $\langle x_t^3\, x_{t+1}\rangle$ (light colors, solid line) and the probability of two successive rises $p_{\text{uu}}(\bm x)$ (medium colors, dashed line) and MAE of an AR(2) predictor (dark colors, dotted line) for time series of lengths $T$= [10, 20, 50, 100, 200, 500, 1000, 2000, 5000] ($x$-axis, logarithmic scale), computed over 100 realizations of each process. Each panel reports a representative process per family, with colors indicating process families (see Table \ref{['tab:processes']}).