Table of Contents
Fetching ...

Data-driven, non-Markovian modelling of weather in the presence of non-stationary, non-Gaussian, and heteroskedastic climate dynamics

Thomas Sayer, Andrés Montoya-Castillo

Abstract

While the generalized Langevin equation (GLE) is a powerful tool to understand the behavior of complex dissipative systems, driving by external fields renders standard GLE construction workflows invalid. Filtering approaches that separate fluctuations from the non-equilibrium response can sometimes circumvent the need for a non-equilibrium formalism when the residual fluctuations are homoskedastic, stationary, and preferably Gaussian. Here, we introduce the temperature time series from Boulder, Colorado, as representative of the more general and complex case where the filtered time series remains non-Gaussian, non-stationary, and heteroskedastic. With this example, we develop a protocol to build an accurate and efficient low-dimensional description of the weather fluctuations. Our protocol classifies the weather data based on the position in the annual cycle, and introduces local homoskedasticity as a metric to identify seasons of likely stationarity. Within these seasons, we build pseudo-equilibrium models. Leveraging state-based generalized master equation modelling as an alternative to the GLE, we resolve difficulties like non-Gaussianity and position dependence of the memory (friction) kernel. Our data-driven approach accurately reproduces the evolving fluctuations of the Boulder temperature time series, illustrating the feasibility of our method as a general tool to describe driven, dissipative systems.

Data-driven, non-Markovian modelling of weather in the presence of non-stationary, non-Gaussian, and heteroskedastic climate dynamics

Abstract

While the generalized Langevin equation (GLE) is a powerful tool to understand the behavior of complex dissipative systems, driving by external fields renders standard GLE construction workflows invalid. Filtering approaches that separate fluctuations from the non-equilibrium response can sometimes circumvent the need for a non-equilibrium formalism when the residual fluctuations are homoskedastic, stationary, and preferably Gaussian. Here, we introduce the temperature time series from Boulder, Colorado, as representative of the more general and complex case where the filtered time series remains non-Gaussian, non-stationary, and heteroskedastic. With this example, we develop a protocol to build an accurate and efficient low-dimensional description of the weather fluctuations. Our protocol classifies the weather data based on the position in the annual cycle, and introduces local homoskedasticity as a metric to identify seasons of likely stationarity. Within these seasons, we build pseudo-equilibrium models. Leveraging state-based generalized master equation modelling as an alternative to the GLE, we resolve difficulties like non-Gaussianity and position dependence of the memory (friction) kernel. Our data-driven approach accurately reproduces the evolving fluctuations of the Boulder temperature time series, illustrating the feasibility of our method as a general tool to describe driven, dissipative systems.
Paper Structure (16 sections, 25 equations, 14 figures)

This paper contains 16 sections, 25 equations, 14 figures.

Figures (14)

  • Figure 1: Daily average temperature data from Boulder, Colorado, from 1992 to 2025.NOAA_Boulder(a) Four-year window of the raw data, with the traditional summer and winter calendar months highlighted with red and blue markers, respectively. The spread in the winter is much larger. (b) Histogram of the raw data showing a bistable and strongly asymmetric distribution. Histograms of just the summer or winter months have different forms. (c) Filtered time series data showing the fluctuations in red and the residual periodic "baseline" in orange. Grey bars show regions excluded from our analysis due to filter-induced artifacts. See Fig. \ref{['fig:correlations']} for further details. (d) Filtered fluctuations become unimodal but remain asymmetric. Again, the statistics differ qualitatively when considering only summer or winter months.
  • Figure 2: Histograms resulting from progressively larger highpass filters. Dashed lines are quadratic fits to the points around the minima. Non-Gaussianity is reduced but ultimately persists until the shortest timescales.
  • Figure 3: Position-dependent friction for degenerate double-well; the mass is independent of position because the GLE describes a phase-space coordinate. (a) GLE applied directly to the position time series using a variety of definitions for the kernel. Inset: fragment of the trajectory showing the jumps between wells, with larger fluctuations in the left well. (b) Partitioning of the position coordinate into 7 states. Inset: PMF, where the dashed line is a guide to the eye showing asymmetry caused by the friction. (c) Diagonal elements of the memory kernel from the states defined in panel (b). Qualitatively different behavior in the two wells' signals with position-dependent friction in a state-based picture (inset). (d) State-based modelling allows an early memory cutoff GME. Inset: crosses show the Markov limit (i.e., a Markov state model) with lagtime equal to the GME cutoff and resolution set by its lagtime.
  • Figure 4: (a) The baseline temperature series, $T_b$ (i.e., the part removed by the filter) and its time derivative $\dot{T}_b$ mapped to normalized polar coordinates. We split this map into regularly spaced (angular) states (Top), which we then aggregate into seasons based on clustering (Bottom). (b) K-means clustering is performed on the $A$ parameter resulting from a fit of Eq. \ref{['eq:asymmetric_histogram']} to the filtered temperature histogram conditioned on each regular state; visually, we choose $k=4$; we then include the outlier with its nearest cluster to have sufficient statistics, resulting in 3 clusters in total. (c) Histograms conditioned on each aggregated state or 'season'. Open circles have $n<30$ observations and are not included in the fit; error bars are $2\sigma/\sqrt{n}$.
  • Figure 5: (a)--(c) Markov stochastic matrices at fixed 2 °F resolution for winter (blue), summer (brown), and equinoctial (pink) seasons. (d)$T_f(t)$ time series generated via Markov chain unraveling for each season, half-overlaid with the true data to show successful generation of colored, non-Gaussian noise. (e) Example of a fully predicted time series of baseline plus fluctuations, including changing between different seasons (colored lines). We displace historic data by 30 °F (in black) to facilitate comparison.
  • ...and 9 more figures