Table of Contents
Fetching ...

Defining error accumulation in ML atmospheric simulators

Raghul Parthipan, Mohit Anand, Hannah M. Christensen, J. Scott Hosking, Damon J. Wischik

TL;DR

This work defines error accumulation for autoregressive ML atmospheric simulators and introduces a KL-divergence-based metric that compares a generative model to a CTS reference to isolate fixable model deficiencies from intrinsic chaos and unobserved-variable effects. It further proposes a regularization strategy that adds a KL penalty to the likelihood objective, guided by the error-accumulation metric, and validates the approach on Lorenz-63, Lorenz-96, and ERA5-based weather data. Results show improvements in RMSE and spread/skill, with the error-accumulation signal providing diagnostic insight into where models may be improved and how CTS quality influences the signal. The findings highlight practical impacts for ensemble forecasting and emphasize CTS improvements as a key lever for advancing ML-based weather prediction while noting computational and methodological limitations.

Abstract

Machine learning (ML) has recently shown significant promise in modelling atmospheric systems, such as the weather. Many of these ML models are autoregressive, and error accumulation in their forecasts is a key problem. However, there is no clear definition of what `error accumulation' actually entails. In this paper, we propose a definition and an associated metric to measure it. Our definition distinguishes between errors which are due to model deficiencies, which we may hope to fix, and those due to the intrinsic properties of atmospheric systems (chaos, unobserved variables), which are not fixable. We illustrate the usefulness of this definition by proposing a simple regularization loss penalty inspired by it. This approach shows performance improvements (according to RMSE and spread/skill) in a selection of atmospheric systems, including the real-world weather prediction task.

Defining error accumulation in ML atmospheric simulators

TL;DR

This work defines error accumulation for autoregressive ML atmospheric simulators and introduces a KL-divergence-based metric that compares a generative model to a CTS reference to isolate fixable model deficiencies from intrinsic chaos and unobserved-variable effects. It further proposes a regularization strategy that adds a KL penalty to the likelihood objective, guided by the error-accumulation metric, and validates the approach on Lorenz-63, Lorenz-96, and ERA5-based weather data. Results show improvements in RMSE and spread/skill, with the error-accumulation signal providing diagnostic insight into where models may be improved and how CTS quality influences the signal. The findings highlight practical impacts for ensemble forecasting and emphasize CTS improvements as a key lever for advancing ML-based weather prediction while noting computational and methodological limitations.

Abstract

Machine learning (ML) has recently shown significant promise in modelling atmospheric systems, such as the weather. Many of these ML models are autoregressive, and error accumulation in their forecasts is a key problem. However, there is no clear definition of what `error accumulation' actually entails. In this paper, we propose a definition and an associated metric to measure it. Our definition distinguishes between errors which are due to model deficiencies, which we may hope to fix, and those due to the intrinsic properties of atmospheric systems (chaos, unobserved variables), which are not fixable. We illustrate the usefulness of this definition by proposing a simple regularization loss penalty inspired by it. This approach shows performance improvements (according to RMSE and spread/skill) in a selection of atmospheric systems, including the real-world weather prediction task.
Paper Structure (70 sections, 20 equations, 18 figures, 1 table)

This paper contains 70 sections, 20 equations, 18 figures, 1 table.

Figures (18)

  • Figure 1: Illustration of a form of explosive error accumulation in the Lorenz 96. (a) Simulations from the random walk generative model (equation \ref{['eq:random_walk']}) and the CTS (equation \ref{['eq:cts_l96']}) for a specific initial condition, with the ground truth (blue). (b) RMSE, (c) spread/skill, and (d) our error accumulation metric. The spread/skill is erratic as the RMSE is often near zero, leading to large spread/skill values. Despite this, the spread/skill of the generative model increases over time (visible in the increasing spread of trajectories in (a)), unlike the CTS. The generative model's poor behaviour (explosive trajectories) is most evident in (d).
  • Figure 2: Illustration of a non-explosive form of error accumulation in the Lorenz 63. (a) Simulations from an iterative generative model and a CTS for a specific initial condition, with the ground truth (blue). The generative model trajectories fail to cover the truth, especially for lead times from 20 to 30. This issue is not apparent from the relatively small RMSE in (b) and can only inferred from the under-dispersion in the spread/skill in (c). (d) Our error accumulation metric captures the generative model's failure to capture the truth, whilst the CTS does, suggesting the generative model's errors are due to model deficiencies as opposed to other factors.
  • Figure 3: Example of predictability limits being reached due to STIC in the Lorenz 63. (a) Simulations from Figure \ref{['fig:err_acc_non_explosion']} extended to further lead times. The predictability horizon having been reached is not immediately obvious from (b) RMSE nor (c) spread/skill. (d) our error accumulation metric remains small, indicating that the generative model performs similarly to the CTS, suggesting the remaining errors are due to factors (STIC) separate to model deficiency.
  • Figure 4: Illustration of error accumulation metric in equation \ref{['eq:err_acc_kl_approx']}. (a) 40 random simulations from a generative model and a CTS for the Lorenz 63 system, given an initial condition. (b) Error accumulation metric. It is high at time 26 as the generative model places much density outside that of the CTS, as shown in (c). It remains high at time 93 due to the generative model's inadequate coverage compared to the CTS, which centres more density around the truth (blue). The metric is low at time 250 since the generative model and the CTS distributions align well. The metric indicates the need for better generative model performance up to lead times of 100, where the CTS is more capable.
  • Figure 5: Evaluation of Lorenz 63 ensembles from a CTS, a generative model, a generative model with rollout training, and a generative model with our regularization strategy. (a) Ensemble-Mean RMSE skill (lower is better). (b) Spread/skill ratio (closer to 1 is better, lower suggests under-dispersion. (c) Error accumulation metric (lower is better). Our approach improves the spread/skill ratio, and achieves skill closer to the CTS at longer lead times. 95% confidence bands are shown.
  • ...and 13 more figures