Table of Contents
Fetching ...

Generative Data Assimilation of Sparse Weather Station Observations at Kilometer Scales

Peter Manshausen, Yair Cohen, Peter Harrington, Jaideep Pathak, Mike Pritchard, Piyush Garg, Morteza Mardani, Karthik Kashinath, Simon Byrne, Noah Brenowitz

TL;DR

This work demonstrates the first km-scale score-based data assimilation of sparse weather-station observations by training an unconditional diffusion model on HRRR analysis and guiding state reconstruction with ISD data. The approach yields physically plausible wind and precipitation fields at 3 km resolution and, in experiments with pseudo- and real observations, shows skill that matches or surpasses the HRRR baseline for wind while offering a fast, flexible, zero-shot integration of new data streams. The study provides evidence of learned cross-variable physics and underscores the potential for ensemble SDA, though it also acknowledges ensemble under-dispersion and calibration needs. Overall, the method offers a simple, scalable pathway to real-time, km-scale regional reanalyses without retraining and with room for incorporating additional data types and time-dimension.

Abstract

Data assimilation of observational data into full atmospheric states is essential for weather forecast model initialization. Recently, methods for deep generative data assimilation have been proposed which allow for using new input data without retraining the model. They could also dramatically accelerate the costly data assimilation process used in operational regional weather models. Here, in a central US testbed, we demonstrate the viability of score-based data assimilation in the context of realistically complex km-scale weather. We train an unconditional diffusion model to generate snapshots of a state-of-the-art km-scale analysis product, the High Resolution Rapid Refresh. Then, using score-based data assimilation to incorporate sparse weather station data, the model produces maps of precipitation and surface winds. The generated fields display physically plausible structures, such as gust fronts, and sensitivity tests confirm learnt physics through multivariate relationships. Preliminary skill analysis shows the approach already outperforms a naive baseline of the High-Resolution Rapid Refresh system itself. By incorporating observations from 40 weather stations, 10% lower RMSEs on left-out stations are attained. Despite some lingering imperfections such as insufficiently disperse ensemble DA estimates, we find the results overall an encouraging proof of concept, and the first at km-scale. It is a ripe time to explore extensions that combine increasingly ambitious regional state generators with an increasing set of in situ, ground-based, and satellite remote sensing data streams.

Generative Data Assimilation of Sparse Weather Station Observations at Kilometer Scales

TL;DR

This work demonstrates the first km-scale score-based data assimilation of sparse weather-station observations by training an unconditional diffusion model on HRRR analysis and guiding state reconstruction with ISD data. The approach yields physically plausible wind and precipitation fields at 3 km resolution and, in experiments with pseudo- and real observations, shows skill that matches or surpasses the HRRR baseline for wind while offering a fast, flexible, zero-shot integration of new data streams. The study provides evidence of learned cross-variable physics and underscores the potential for ensemble SDA, though it also acknowledges ensemble under-dispersion and calibration needs. Overall, the method offers a simple, scalable pathway to real-time, km-scale regional reanalyses without retraining and with room for incorporating additional data types and time-dimension.

Abstract

Data assimilation of observational data into full atmospheric states is essential for weather forecast model initialization. Recently, methods for deep generative data assimilation have been proposed which allow for using new input data without retraining the model. They could also dramatically accelerate the costly data assimilation process used in operational regional weather models. Here, in a central US testbed, we demonstrate the viability of score-based data assimilation in the context of realistically complex km-scale weather. We train an unconditional diffusion model to generate snapshots of a state-of-the-art km-scale analysis product, the High Resolution Rapid Refresh. Then, using score-based data assimilation to incorporate sparse weather station data, the model produces maps of precipitation and surface winds. The generated fields display physically plausible structures, such as gust fronts, and sensitivity tests confirm learnt physics through multivariate relationships. Preliminary skill analysis shows the approach already outperforms a naive baseline of the High-Resolution Rapid Refresh system itself. By incorporating observations from 40 weather stations, 10% lower RMSEs on left-out stations are attained. Despite some lingering imperfections such as insufficiently disperse ensemble DA estimates, we find the results overall an encouraging proof of concept, and the first at km-scale. It is a ripe time to explore extensions that combine increasingly ambitious regional state generators with an increasing set of in situ, ground-based, and satellite remote sensing data streams.

Paper Structure

This paper contains 20 sections, 20 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Denoiser training and data assimilation with SDA. a) During the training of the denoiser, noise is added to the training data at different levels, parameterized by diffusion-time $\tau\in [0,1]$. The training objective for the denoiser $D$ is to reconstruct the training data, given the noisy state and the time $\tau$. b) Data assimilation then uses $D$ to go from a noisy state $\mathbf{x}_\tau$ to a possible denoised state $\hat{\mathbf{x}}$. The observation operator $\mathcal{H}$ then maps $\hat{\mathbf{x}}$ to the observations it would give rise to, which we compare to the actual observations $\mathbf{y}$ (e.g. weather station data). The difference of the two is used to calculate the score $\nabla_{\mathbf{x}} \log p(\mathbf{y}|\mathbf{x}_\tau)$, giving the direction in which the noisy state $\mathbf{x}_\tau$ is updated. This cycle repeats for a number of steps, with time running from 1 to 0, until $\mathbf{x}$ is denoised, taking into account the observations and the model's learned prior (the HRRR reanalysis).
  • Figure 2: Assimilating increasingly sparse and noisy data. Columns show the different variables 10u, 10v, and tp for different study cases. In row one, we show HRRR data of 2017-05-28 03:00 UTC, as well as the station data (triangles). Rows two and three show this data subsampled to 1.6% and 0.3%, respectively, in a regular grid (pentagons), plotted over the assimilated high-resolution state. Row four shows the HRRR data subsampled to the locations of our ISD weather stations, as well as the assimilated state. Row five shows again the observations from the stations (triangles), as well as the assimilated state. For a visualization of the HRRR winds as vector arrows, see Fig. \ref{['missingch']}
  • Figure 3: Generating a left-out variable from other variables. We feed the model the HRRR 10u and precipitation, leaving out the 10v channel (top left). The model generates a reasonable 10v (top right). The bottom row shows HRRR tp overlaid with a quiver plot of 10u and 10v from HRRR (bottom left) and 10u and 10v from the model output. Note the wind arrows pointing away from the precipitation in both cases.
  • Figure 4: SDA can produce stochastic ensembles of assimilated states. We assimilate the same station data as in Fig. \ref{['subsample']}, but now generate a 20-member ensemble of states. We show the first member in the second row, the ensemble mean in the third, and the standard deviation in the fourth.
  • Figure 5: Testing the dependence of assimilation on station density. Evaluating using data from the whole year of 2017, we vary the number of stations used for guiding the inference by the SDA framework. The resulting states are evaluated on the held-out stations, giving the RMSEs in each of the variables (solid lines). We also evaluate the RMSE of the HRRR analysis on the same held-out stations (dotted lines).
  • ...and 4 more figures