Table of Contents
Fetching ...

How far are today's time-series models from real-world weather forecasting applications?

Tao Han, Song Guo, Zhenghao Chen, Wanghan Xu, Lei Bai

TL;DR

This paper addresses the gap between research-based time-series forecasting (TSF) and real-world weather forecasting by introducing WEATHER-5K, a large-scale global station dataset with 5,672 hourly stations over 2014–2023, complemented by rigorous quality control, post-processing, and extreme-event evaluation. It provides a standardized benchmark comparing TSF methods with operational NWP models, using metrics like MAE, MSE, and SEDI for extremes, across multiple forecast horizons. Key findings show that while NWP models outperform TSF at longer lead times, some TSF approaches can match short-term forecasts, yet TSF generally lags in extreme-weather prediction and long-range accuracy; larger models do not guarantee better performance, and integrating NWP outputs as priors or bias corrections can improve TSF-based forecasts. The WEATHER-5K dataset and benchmark framework offer a reproducible platform to drive next-generation TSF methods toward more accurate, scalable real-world weather forecasting and decision support.

Abstract

The development of Time-Series Forecasting (TSF) techniques is often hindered by the lack of comprehensive datasets. This is particularly problematic for time-series weather forecasting, where commonly used datasets suffer from significant limitations such as small size, limited temporal coverage, and sparse spatial distribution. These constraints severely impede the optimization and evaluation of TSF models, resulting in benchmarks that are not representative of real-world applications, such as operational weather forecasting. In this work, we introduce the WEATHER-5K dataset, a comprehensive collection of observational weather data that better reflects real-world scenarios. As a result, it enables a better training of models and a more accurate assessment of the real-world forecasting capabilities of TSF models, pushing them closer to in-situ applications. Through extensive benchmarking against operational Numerical Weather Prediction (NWP) models, we provide researchers with a clear assessment of the gap between academic TSF models and real-world weather forecasting applications. This highlights the significant performance disparity between TSF and NWP models by analyzing performance across detailed weather variables, extreme weather event prediction, and model complexity comparison. Finally, we summarise the result into recommendations to the users and highlight potential areas required to facilitate further TSF research. The dataset and benchmark implementation are available at: https://github.com/taohan10200/WEATHER-5K.

How far are today's time-series models from real-world weather forecasting applications?

TL;DR

This paper addresses the gap between research-based time-series forecasting (TSF) and real-world weather forecasting by introducing WEATHER-5K, a large-scale global station dataset with 5,672 hourly stations over 2014–2023, complemented by rigorous quality control, post-processing, and extreme-event evaluation. It provides a standardized benchmark comparing TSF methods with operational NWP models, using metrics like MAE, MSE, and SEDI for extremes, across multiple forecast horizons. Key findings show that while NWP models outperform TSF at longer lead times, some TSF approaches can match short-term forecasts, yet TSF generally lags in extreme-weather prediction and long-range accuracy; larger models do not guarantee better performance, and integrating NWP outputs as priors or bias corrections can improve TSF-based forecasts. The WEATHER-5K dataset and benchmark framework offer a reproducible platform to drive next-generation TSF methods toward more accurate, scalable real-world weather forecasting and decision support.

Abstract

The development of Time-Series Forecasting (TSF) techniques is often hindered by the lack of comprehensive datasets. This is particularly problematic for time-series weather forecasting, where commonly used datasets suffer from significant limitations such as small size, limited temporal coverage, and sparse spatial distribution. These constraints severely impede the optimization and evaluation of TSF models, resulting in benchmarks that are not representative of real-world applications, such as operational weather forecasting. In this work, we introduce the WEATHER-5K dataset, a comprehensive collection of observational weather data that better reflects real-world scenarios. As a result, it enables a better training of models and a more accurate assessment of the real-world forecasting capabilities of TSF models, pushing them closer to in-situ applications. Through extensive benchmarking against operational Numerical Weather Prediction (NWP) models, we provide researchers with a clear assessment of the gap between academic TSF models and real-world weather forecasting applications. This highlights the significant performance disparity between TSF and NWP models by analyzing performance across detailed weather variables, extreme weather event prediction, and model complexity comparison. Finally, we summarise the result into recommendations to the users and highlight potential areas required to facilitate further TSF research. The dataset and benchmark implementation are available at: https://github.com/taohan10200/WEATHER-5K.
Paper Structure (44 sections, 1 equation, 14 figures, 6 tables)

This paper contains 44 sections, 1 equation, 14 figures, 6 tables.

Figures (14)

  • Figure 1: A case study of winter storm forecasting using methods from two communities: NWP model and TSF model. It demonstrates a big performance discrepancy between them.
  • Figure 2: Flow diagram of the benchmark. a) Developing a downloading API to retrieve the raw ISD data and then do some pre-posting processing. b) Conducting rigorous quality control on selected stations to obtain a high-quality ISD subset. c) Using ERA5 to complete some missing data in the selected stations, which ensures 100% data completeness for training TSF models. d) Training and evaluating some main-stream TSF models with the basic metrics and a new proposed SEDI metric for extreme value evaluation.
  • Figure 3: a) and b) The visualization of time-series data over a year. c) The geographical distribution of the weather stations in WEATHER-5K. d) The error between the observations and the ERA5 dataset. e) The daily 2m temperature at station 57516099999 in Chongqing City from 1st June to 15th September, where filled areas represent the variance from the daily mean.
  • Figure 4: a) Model performance vs complexity. b) The performance impact of input length.
  • Figure 5: Statistics on the Number of Weather Stations in Different Countries and Regions.
  • ...and 9 more figures