Table of Contents
Fetching ...

Am I Confused or Is This Confusing?: Deep Ensembles for ENSO Uncertainty Quantification

Devin M. McAfee, Elizabeth A. Barnes

TL;DR

The paper tackles uncertainty quantification in climate predictions of ENSO under climate-change–driven covariate shift. It adopts large deep ensembles of probabilistic networks on CESM2-LE data, explicitly disentangling aleatoric and epistemic uncertainty with AU and EU definitions, including $\text{AU}(\mathbf{x}) = -\frac{1}{M}\sum_{i=1}^{M} \sum_{k=1}^{K} p_{\mathbf{w}_i}(y=k\mid \mathbf{x}) \log p_{\mathbf{w}_i}(y=k\mid \mathbf{x})$ and $\text{EU}(\mathbf{x}) = \frac{1}{M} \sum_{k=1}^{K} \sum_{i=1}^{M} (p_{\mathbf{w}_i}(y=k\mid \mathbf{x}) - p_{\text{ens}}(y=k\mid \mathbf{x}))^2$. The findings show that epistemic uncertainty robustly signals predictive error growth under warming scenarios, while aleatoric uncertainty becomes unreliable as the input distribution shifts; ensemble improvement scales with EU and increases with distributional shift, and temperature scaling can correct calibration biases to recover short-lead performance. These results support using deep ensembles for robust, interpretable UQ in climate prediction and highlight the need to account for epistemic uncertainty when forecasting under nonstationary climates.

Abstract

Faithful uncertainty quantification (UQ) is paramount in high stakes climate prediction. Deep ensembles, or ensembles of probabilistic neural networks, are state of the art for UQ in machine learning (ML) and are growing increasingly popular for weather and climate prediction. However, detailed analyses of the mechanisms, strengths, and limitations of ensembles in these complex problem settings are lacking. We take a step towards filling this gap by deploying deep ensembles for predictability analysis of the El-Niño Southern Oscillation (ENSO) in the Community Earth System Model 2 Large Ensemble (CESM2-LE). Principally, we show that epistemic uncertainty, modeled by ensemble disagreement, robustly signals predictive error growth associated with shifts in the distributions of monthly sea-surface temperature (SST), ocean heat content (OHC), and zonal surface wind stress ($τ_x$) anomalies under a climate change scenario. Conversely, we find that aleatoric uncertainty, which remains a popular measure of model confidence, becomes less reliable and behaves counterintuitively under climate-change-induced distributional shift. We highlight that, because ensemble performance improvement relative to the expected single model scales with epistemic uncertainty, ensemble improvement increases with distributional shift from climate change. This work demonstrates the utility of deep ensembles for modeling aleatoric and epistemic uncertainty in ML climate prediction, as well as the growing importance of robustly quantifying these two forms of uncertainty under anthropogenic warming.

Am I Confused or Is This Confusing?: Deep Ensembles for ENSO Uncertainty Quantification

TL;DR

The paper tackles uncertainty quantification in climate predictions of ENSO under climate-change–driven covariate shift. It adopts large deep ensembles of probabilistic networks on CESM2-LE data, explicitly disentangling aleatoric and epistemic uncertainty with AU and EU definitions, including and . The findings show that epistemic uncertainty robustly signals predictive error growth under warming scenarios, while aleatoric uncertainty becomes unreliable as the input distribution shifts; ensemble improvement scales with EU and increases with distributional shift, and temperature scaling can correct calibration biases to recover short-lead performance. These results support using deep ensembles for robust, interpretable UQ in climate prediction and highlight the need to account for epistemic uncertainty when forecasting under nonstationary climates.

Abstract

Faithful uncertainty quantification (UQ) is paramount in high stakes climate prediction. Deep ensembles, or ensembles of probabilistic neural networks, are state of the art for UQ in machine learning (ML) and are growing increasingly popular for weather and climate prediction. However, detailed analyses of the mechanisms, strengths, and limitations of ensembles in these complex problem settings are lacking. We take a step towards filling this gap by deploying deep ensembles for predictability analysis of the El-Niño Southern Oscillation (ENSO) in the Community Earth System Model 2 Large Ensemble (CESM2-LE). Principally, we show that epistemic uncertainty, modeled by ensemble disagreement, robustly signals predictive error growth associated with shifts in the distributions of monthly sea-surface temperature (SST), ocean heat content (OHC), and zonal surface wind stress () anomalies under a climate change scenario. Conversely, we find that aleatoric uncertainty, which remains a popular measure of model confidence, becomes less reliable and behaves counterintuitively under climate-change-induced distributional shift. We highlight that, because ensemble performance improvement relative to the expected single model scales with epistemic uncertainty, ensemble improvement increases with distributional shift from climate change. This work demonstrates the utility of deep ensembles for modeling aleatoric and epistemic uncertainty in ML climate prediction, as well as the growing importance of robustly quantifying these two forms of uncertainty under anthropogenic warming.

Paper Structure

This paper contains 17 sections, 21 equations, 32 figures.

Figures (32)

  • Figure 1: Schematic of the deep ensemble framework given a multimodal posterior for a three-class classification problem (probability masses denoted by colored bars). Each component is a potentially unique mode of $p(\mathbf{w} \mid \mathcal{D})$, which produces potentially unique functional representations of the data.
  • Figure 2: Illustration of the study's problem setup using an example prediction from a component of $\textbf{premodern}$ initialized in January. A ResNet-18 model ingests the last three months of anomalies (in units of standard deviations) and outputs categorical distributions over ENSO classes for the next 24 months. The highlighted distributions are for leads 1, 3, 6, $\dots$ and 24 months, and the faint distributions are for the remaining leads.
  • Figure 3: (a-c) Performance and (d) EU for the period [1850, 1949], averaged across leads, as a function of ensemble size for $\textbf{premodern}$, $\textbf{modern}$, and $\textbf{scenario}$. Shadings cover the range of scores from random subsampling of ensemble components without replacement, and curves represent the mean scores across subsamples.
  • Figure 4: Mean testing scores for ensemble components (histograms) and deep ensembles (stars) over the premodern period.
  • Figure 5: Difference in performance between ensemble and best performing component on the testing set over the premodern period.
  • ...and 27 more figures