SAFE: A Novel Approach to AI Weather Evaluation through Stratified Assessments of Forecasts over Earth
Nick Masi, Randall Balestriero
TL;DR
This work tackles the problem that globally averaged evaluation metrics obscure geospatial and socio-economic disparities in AI weather forecasts. It introduces SAFE, an open-source framework that stratifies forecast performance across territory, global subregion, income, and landcover, using accurate area weighting on an oblate Earth to compute $\mathrm{RMSE}$ and fairness metrics. Through benchmarking six state-of-the-art AI weather predictions on ERA5 data (2020) for $T_{850}$ and $Z_{500}$ across lead times up to 10 days, the study reveals persistent disparities that grow with lead time, and shows that some models (e.g., FuXi) are consistently more fair than others. The results underscore the value of stratified fairness analyses for deployment decisions and model development, and SAFE provides a practical tool to enable location-aware forecasting and accountability in weather prediction systems; the code is openly available at https://github.com/N-Masi/safe.
Abstract
The dominant paradigm in machine learning is to assess model performance based on average loss across all samples in some test set. This amounts to averaging performance geospatially across the Earth in weather and climate settings, failing to account for the non-uniform distribution of human development and geography. We introduce Stratified Assessments of Forecasts over Earth (SAFE), a package for elucidating the stratified performance of a set of predictions made over Earth. SAFE integrates various data domains to stratify by different attributes associated with geospatial gridpoints: territory (usually country), global subregion, income, and landcover (land or water). This allows us to examine the performance of models for each individual stratum of the different attributes (e.g., the accuracy in every individual country). To demonstrate its importance, we utilize SAFE to benchmark a zoo of state-of-the-art AI-based weather prediction models, finding that they all exhibit disparities in forecasting skill across every attribute. We use this to seed a benchmark of model forecast fairness through stratification at different lead times for various climatic variables. By moving beyond globally-averaged metrics, we for the first time ask: where do models perform best or worst, and which models are most fair? To support further work in this direction, the SAFE package is open source and available at https://github.com/N-Masi/safe
