Table of Contents
Fetching ...

Distilling Machine Learning's Added Value: Pareto Fronts in Atmospheric Applications

Tom Beucler, Arthur Grundner, Sara Shamekh, Peter Ukkonen, Matthew Chantry, Ryan Lagerquist

TL;DR

The paper introduces a framework of Pareto-optimal model hierarchies to quantify the added value of machine learning in atmospheric science by balancing predictive error and model complexity. It defines four categories of ML-added value—functional representation, feature assimilation, spatial connectivity, and temporal connectivity—and demonstrates how these concepts guide model development. Through three atmospheric case studies—cloud cover parameterization, shortwave radiative transfer emulation, and tropical precipitation—the authors show how symbolic regression, physics-guided architectures, and memory can yield interpretable, efficient models that rival or approach complex deep learning systems. The work emphasizes interpretability and trust, proposing a structured approach to extract scientific insight and scalable, trustworthy ML tools for weather and climate applications.

Abstract

The added value of machine learning for weather and climate applications is measurable through performance metrics, but explaining it remains challenging, particularly for large deep learning models. Inspired by climate model hierarchies, we propose that a full hierarchy of Pareto-optimal models, defined within an appropriately determined error-complexity plane, can guide model development and help understand the models' added value. We demonstrate the use of Pareto fronts in atmospheric physics through three sample applications, with hierarchies ranging from semi-empirical models with minimal parameters to deep learning algorithms. First, in cloud cover parameterization, we find that neural networks identify nonlinear relationships between cloud cover and its thermodynamic environment, and assimilate previously neglected features such as vertical gradients in relative humidity that improve the representation of low cloud cover. This added value is condensed into a ten-parameter equation that rivals deep learning models. Second, we establish a machine learning model hierarchy for emulating shortwave radiative transfer, distilling the importance of bidirectional vertical connectivity for accurately representing absorption and scattering, especially for multiple cloud layers. Third, we emphasize the importance of convective organization information when modeling the relationship between tropical precipitation and its surrounding environment. We discuss the added value of temporal memory when high-resolution spatial information is unavailable, with implications for precipitation parameterization. Therefore, by comparing data-driven models directly with existing schemes using Pareto optimality, we promote process understanding by hierarchically unveiling system complexity, with the hope of improving the trustworthiness of machine learning models in atmospheric applications.

Distilling Machine Learning's Added Value: Pareto Fronts in Atmospheric Applications

TL;DR

The paper introduces a framework of Pareto-optimal model hierarchies to quantify the added value of machine learning in atmospheric science by balancing predictive error and model complexity. It defines four categories of ML-added value—functional representation, feature assimilation, spatial connectivity, and temporal connectivity—and demonstrates how these concepts guide model development. Through three atmospheric case studies—cloud cover parameterization, shortwave radiative transfer emulation, and tropical precipitation—the authors show how symbolic regression, physics-guided architectures, and memory can yield interpretable, efficient models that rival or approach complex deep learning systems. The work emphasizes interpretability and trust, proposing a structured approach to extract scientific insight and scalable, trustworthy ML tools for weather and climate applications.

Abstract

The added value of machine learning for weather and climate applications is measurable through performance metrics, but explaining it remains challenging, particularly for large deep learning models. Inspired by climate model hierarchies, we propose that a full hierarchy of Pareto-optimal models, defined within an appropriately determined error-complexity plane, can guide model development and help understand the models' added value. We demonstrate the use of Pareto fronts in atmospheric physics through three sample applications, with hierarchies ranging from semi-empirical models with minimal parameters to deep learning algorithms. First, in cloud cover parameterization, we find that neural networks identify nonlinear relationships between cloud cover and its thermodynamic environment, and assimilate previously neglected features such as vertical gradients in relative humidity that improve the representation of low cloud cover. This added value is condensed into a ten-parameter equation that rivals deep learning models. Second, we establish a machine learning model hierarchy for emulating shortwave radiative transfer, distilling the importance of bidirectional vertical connectivity for accurately representing absorption and scattering, especially for multiple cloud layers. Third, we emphasize the importance of convective organization information when modeling the relationship between tropical precipitation and its surrounding environment. We discuss the added value of temporal memory when high-resolution spatial information is unavailable, with implications for precipitation parameterization. Therefore, by comparing data-driven models directly with existing schemes using Pareto optimality, we promote process understanding by hierarchically unveiling system complexity, with the hope of improving the trustworthiness of machine learning models in atmospheric applications.
Paper Structure (27 sections, 37 equations, 4 figures)

This paper contains 27 sections, 37 equations, 4 figures.

Figures (4)

  • Figure 1: Exploring Pareto fronts (sets of Pareto-optimal models) within a complexity-error plane highlights machine learning's added value. Crosses in step 1 denote existing models. Algorithms such as deep learning allow for the creation of efficient, low-error, albeit complex models (step 2). Knowledge distillation, through methods such as equation discovery, aims to explain error reduction, resulting in simpler, low-error models (step 3) and long-lasting scientific progress. For atmospheric applications, we propose four categories to classify this added value: functional representation, feature assimilation, spatial connectivity, and temporal connectivity.
  • Figure 2: Pareto-optimal model hierarchies quantify the added value of machine learning for cloud cover parameterization. Machine learning better captures the relationship between cloud cover and its thermodynamic environment and assimilates features like vertical humidity gradients. (Left) We progressively improve traditional baselines via polynomial regression (red, orange, and yellow crosses), significantly decrease error using neural networks (pink and purple crosses), and finally distill the added value of these neural networks symbolically (green crosses). (Right) Both the neural network (orange line) and its distilled symbolic representation (green line) better represent the functional relationship between cloud cover and its environment, aligning more closely across temperatures with the reference storm-resolving simulation (blue dots) than the Sundqvist scheme (red line) used in the ICON Earth system model. "Cold" and "Hot" refer to the validation set's first and last temperature octiles. Additionally, machine learning models assimilate multiple features absent in existing baselines, including vertical humidity gradients. The smaller discrepancy between the 5-feature scheme ('SFS5') and the reference ('REF'), compared to the 4-feature scheme ('SFS4'), demonstrates improved representation of the time-averaged low cloud cover in regions such as the Southeast Pacific, thereby reducing biases in current cloud cover schemes that plague the global radiative budget.
  • Figure 3: Pareto-optimal model hierarchies guide the development of progressively tailored architectures for emulating shortwave radiative transfer. Panel (a) shows error vs. complexity on a logarithmic scale for the simple clear-sky cases dominated by absorption; panel (b) shows error vs. complexity for cases with multi-layer cloud, including both liquid and ice, where multiple scattering complicates radiative transfer. Convolutional neural networks (CNN; red crosses) with small kernels, multilayer perceptrons (MLP; orange crosses) that ignore the vertical dimension, and the simple linear baseline (light pink star) give credible results in the clear-sky case. However, they fail in the more complex case, which requires U-net architectures (dark pink and purple crosses) to fully capture non-local radiative transfer. The vertical invariance of the two-stream radiative transfer equations suggests a bidirectional recurrent neural network (RNN; green star) architecture, which rivals the skill of U-nets with a fraction of their trainable parameters.
  • Figure 4: Pareto-optimal model hierarchies underscore the importance of storm-resolving information in elucidating the relationship between precipitation and its surrounding environment, while also quantifying the recoverability of this information from the coarse environment's time series. (Left) Neural networks (NN) leveraging high-resolution spatial data (purple crosses) clearly outperform NNs that use only coarse inputs (orange crosses). However, this performance gap is largely mitigated when the coarse inputs' past time steps are included (green crosses). (Right) Processing the precipitable water field at a resolution of $\Delta x \approx 5$ km yields coefficients of determination $R^2 \approx 0.9$, clearly surpassing the $R^2 \approx 0.5$ attained by our best NN using fields at the coarse $\Delta x \approx 10^2$ km horizontal resolution. This performance gap is partially closed by incorporating two past time steps along with the current timestep, resulting in $R^2 \approx 0.7$. This suggests a partial equivalence of the environment's spatial and temporal connectivities in predicting precipitation.