Table of Contents
Fetching ...

SAFE: A Novel Approach to AI Weather Evaluation through Stratified Assessments of Forecasts over Earth

Nick Masi, Randall Balestriero

TL;DR

This work tackles the problem that globally averaged evaluation metrics obscure geospatial and socio-economic disparities in AI weather forecasts. It introduces SAFE, an open-source framework that stratifies forecast performance across territory, global subregion, income, and landcover, using accurate area weighting on an oblate Earth to compute $\mathrm{RMSE}$ and fairness metrics. Through benchmarking six state-of-the-art AI weather predictions on ERA5 data (2020) for $T_{850}$ and $Z_{500}$ across lead times up to 10 days, the study reveals persistent disparities that grow with lead time, and shows that some models (e.g., FuXi) are consistently more fair than others. The results underscore the value of stratified fairness analyses for deployment decisions and model development, and SAFE provides a practical tool to enable location-aware forecasting and accountability in weather prediction systems; the code is openly available at https://github.com/N-Masi/safe.

Abstract

The dominant paradigm in machine learning is to assess model performance based on average loss across all samples in some test set. This amounts to averaging performance geospatially across the Earth in weather and climate settings, failing to account for the non-uniform distribution of human development and geography. We introduce Stratified Assessments of Forecasts over Earth (SAFE), a package for elucidating the stratified performance of a set of predictions made over Earth. SAFE integrates various data domains to stratify by different attributes associated with geospatial gridpoints: territory (usually country), global subregion, income, and landcover (land or water). This allows us to examine the performance of models for each individual stratum of the different attributes (e.g., the accuracy in every individual country). To demonstrate its importance, we utilize SAFE to benchmark a zoo of state-of-the-art AI-based weather prediction models, finding that they all exhibit disparities in forecasting skill across every attribute. We use this to seed a benchmark of model forecast fairness through stratification at different lead times for various climatic variables. By moving beyond globally-averaged metrics, we for the first time ask: where do models perform best or worst, and which models are most fair? To support further work in this direction, the SAFE package is open source and available at https://github.com/N-Masi/safe

SAFE: A Novel Approach to AI Weather Evaluation through Stratified Assessments of Forecasts over Earth

TL;DR

This work tackles the problem that globally averaged evaluation metrics obscure geospatial and socio-economic disparities in AI weather forecasts. It introduces SAFE, an open-source framework that stratifies forecast performance across territory, global subregion, income, and landcover, using accurate area weighting on an oblate Earth to compute and fairness metrics. Through benchmarking six state-of-the-art AI weather predictions on ERA5 data (2020) for and across lead times up to 10 days, the study reveals persistent disparities that grow with lead time, and shows that some models (e.g., FuXi) are consistently more fair than others. The results underscore the value of stratified fairness analyses for deployment decisions and model development, and SAFE provides a practical tool to enable location-aware forecasting and accountability in weather prediction systems; the code is openly available at https://github.com/N-Masi/safe.

Abstract

The dominant paradigm in machine learning is to assess model performance based on average loss across all samples in some test set. This amounts to averaging performance geospatially across the Earth in weather and climate settings, failing to account for the non-uniform distribution of human development and geography. We introduce Stratified Assessments of Forecasts over Earth (SAFE), a package for elucidating the stratified performance of a set of predictions made over Earth. SAFE integrates various data domains to stratify by different attributes associated with geospatial gridpoints: territory (usually country), global subregion, income, and landcover (land or water). This allows us to examine the performance of models for each individual stratum of the different attributes (e.g., the accuracy in every individual country). To demonstrate its importance, we utilize SAFE to benchmark a zoo of state-of-the-art AI-based weather prediction models, finding that they all exhibit disparities in forecasting skill across every attribute. We use this to seed a benchmark of model forecast fairness through stratification at different lead times for various climatic variables. By moving beyond globally-averaged metrics, we for the first time ask: where do models perform best or worst, and which models are most fair? To support further work in this direction, the SAFE package is open source and available at https://github.com/N-Masi/safe

Paper Structure

This paper contains 24 sections, 1 equation, 9 figures, 9 tables.

Figures (9)

  • Figure 1: GraphCast displays non-uniform error in temperature prediction. The temporally-averaged gridpoint specific RMSE of temperature predictions at 850hPa (T850) made by GraphCast for every 12 hours in 2020 are shown. Predictions made with 3 day lead time, meaning they predict the temperature 72 hours after the input conditions. Lower RMSE is better. GraphCast inference predictions from WeatherBench 2, ground truth temperature values from ECMWF ERA5. Spatial resolution is 1.5 degrees.
  • Figure 2: Greatest absolute difference of any two per-strata RMSE for each attribute when predicting T850 and Z500 at different lead times. Lower difference is more fair. Starting at a lead time of about one week, FuXi is the most fair model across all attributes and variables.
  • Figure 3: Variance of all the per-strata RMSE for each attribute when predicting T850 and Z500 at different lead times. Lower variance is more fair.
  • Figure 4: Per-strata RMSE for the income attribute of each model. This captures how well models perform at predicting each climatic variable stratified by the income classification for the associated country. We see that a bias against high income countries grows over time.
  • Figure 5: Per-strata RMSE for the landcover attribute of each model. This captures how well models perform at predicting each climatic variable stratified by the prediction being over land or water (oceans, seas, and many large lakes).
  • ...and 4 more figures