Table of Contents
Fetching ...

Forecasting the Future with Yesterday's Climate: Temperature Bias in AI Weather and Climate Models

Jacob B. Landsberg, Elizabeth A. Barnes

TL;DR

This study probes why AI weather and climate models trained on historical data struggle to predict future climates. By examining boreal-winter 2m temperatures from two weather models (FourCastNet V2, Pangu) and one climate model (ACE2) against ERA5 for periods beyond their training data, the authors reveal systematic cold biases that pull forecasts toward older climates by 15–30 years. The weather models show the strongest biases in the hottest forecasts, while ACE2 biases are greatest in the coldest forecasts, aligning with regional warming trends and training distributions. The findings highlight extrapolation limitations in data-driven models and advocate for training-data augmentation and climate-robust design to mitigate these biases in future climate prediction.

Abstract

AI-based climate and weather models have rapidly gained popularity, providing faster forecasts with skill that can match or even surpass that of traditional dynamical models. Despite this success, these models face a key challenge: predicting future climates while being trained only with historical data. In this study, we investigate this issue by analyzing boreal winter land temperature biases in AI weather and climate models. We examine two weather models, FourCastNet V2 Small (FourCastNet) and Pangu Weather (Pangu), evaluating their predictions for 2020-2025 and Ai2 Climate Emulator version 2 (ACE2) for 1996-2010. These time periods lie outside of the respective models' training sets and are significantly more recent than the bulk of their training data, allowing us to assess how well the models generalize to new, i.e. more modern, conditions. We find that all three models produce cold-biased mean temperatures, resembling climates from 15-20 years earlier than the period they are predicting. In some regions, like the Eastern U.S., the predictions resemble climates from as much as 20-30 years earlier. Further analysis shows that FourCastNet's and Pangu's cold bias is strongest in the hottest predicted temperatures, indicating limited training exposure to modern extreme heat events. In contrast, ACE2's bias is more evenly distributed but largest in regions, seasons, and parts of the temperature distribution where climate change has been most pronounced. These findings underscore the challenge of training AI models exclusively on historical data and highlight the need to account for such biases when applying them to future climate prediction.

Forecasting the Future with Yesterday's Climate: Temperature Bias in AI Weather and Climate Models

TL;DR

This study probes why AI weather and climate models trained on historical data struggle to predict future climates. By examining boreal-winter 2m temperatures from two weather models (FourCastNet V2, Pangu) and one climate model (ACE2) against ERA5 for periods beyond their training data, the authors reveal systematic cold biases that pull forecasts toward older climates by 15–30 years. The weather models show the strongest biases in the hottest forecasts, while ACE2 biases are greatest in the coldest forecasts, aligning with regional warming trends and training distributions. The findings highlight extrapolation limitations in data-driven models and advocate for training-data augmentation and climate-robust design to mitigate these biases in future climate prediction.

Abstract

AI-based climate and weather models have rapidly gained popularity, providing faster forecasts with skill that can match or even surpass that of traditional dynamical models. Despite this success, these models face a key challenge: predicting future climates while being trained only with historical data. In this study, we investigate this issue by analyzing boreal winter land temperature biases in AI weather and climate models. We examine two weather models, FourCastNet V2 Small (FourCastNet) and Pangu Weather (Pangu), evaluating their predictions for 2020-2025 and Ai2 Climate Emulator version 2 (ACE2) for 1996-2010. These time periods lie outside of the respective models' training sets and are significantly more recent than the bulk of their training data, allowing us to assess how well the models generalize to new, i.e. more modern, conditions. We find that all three models produce cold-biased mean temperatures, resembling climates from 15-20 years earlier than the period they are predicting. In some regions, like the Eastern U.S., the predictions resemble climates from as much as 20-30 years earlier. Further analysis shows that FourCastNet's and Pangu's cold bias is strongest in the hottest predicted temperatures, indicating limited training exposure to modern extreme heat events. In contrast, ACE2's bias is more evenly distributed but largest in regions, seasons, and parts of the temperature distribution where climate change has been most pronounced. These findings underscore the challenge of training AI models exclusively on historical data and highlight the need to account for such biases when applying them to future climate prediction.

Paper Structure

This paper contains 11 sections, 5 figures.

Figures (5)

  • Figure 1: Mean 2mT differences for 2020-2025 boreal winter land temperatures compared to ERA5 for (a) FourCastNet 2-day lead, (b) FourCastNet 9-day lead, (c) Pangu 2-day lead, and (d) Pangu 9-day lead. Global means are shown at the bottom of each panel UserColor , with stippling indicating statistically significant non-zero bias (see Methods) .
  • Figure 2: The closest matching 5-year span of ERA5 land temperatures to FourCastNet and Pangu's 9-day lead forecasts of 2020-2025 boreal winter land temperatures for a) a 9-day persistence forecast, b) FourCastNet 2-day prediction, c) FourCastNet 9-day prediction, d) Pangu 2-day prediction, and e) Pangu 9-day prediction. The Eastern U.S. (highlighted by the black box) and global mean time period are shown in the legend. UserColor Stippling indicates grid points that have statistically significant non-zero bias.
  • Figure 3: Mean 2mT differences as in Figure \ref{['fig:figure1_fourcast_pangu_cold_bias']} but for the 10th and 90th percentiles of FourCastNet's (a, b) and Pangu's (c,d) 9-day lead forecasts UserColor , with stippling indicating statistically significant non-zero bias . An example of the tail behavior for the SE U.S. (bounded by the yellow box in a-d) is shown in e. The global mean percent of training data as or more extreme than the 10th and 90th percentiles of 2020-2025 ERA5 temperatures is displayed in f.
  • Figure 4: a) Mean surface temperature differences for 1996-2010 boreal winter land temperatures compared to ERA5 for ACE2. b) The closest matching 15-year span of ERA5 land temperatures to ACE2's 1996-2010 boreal winter land temperatures. The Eastern U.S. (highlighted by the black box) and global mean time period are shown in the legend. c) Mean surface temperature differences as in (A) but for the 10th percentile of ACE2's 1996-2010 predictions. d) Mean surface temperature differences as in (a) but for the 90th percentile of ACE2's 1996-2010 predictions. Global means are shown at the bottom of (a), (c), and (d) UserColor , while all figures show stippling at grid points of statistically significant non-zero bias .
  • Figure 5: a) Change boreal winter land surface temperatures between 1940-1979 and 1980-2022 relative to the annual mean change. b) Change in the 10th vs. 90th percentile of boreal winter land surface temperatures between 1940-1979 and 1980-2022. c) Mean surface temperature differences for 1996-2010 boreal summer land temperatures compared to ERA5 for ACE2 UserColor . Stipping indicates statistically significant non-zero bias . d) Change in boreal summer land surface temperatures between 1940-1979 and 1980-2022. Global means are shown at the bottom of each panel.