Table of Contents
Fetching ...

CRA5: Extreme Compression of ERA5 for Portable Global Climate and Weather Research via an Efficient Variational Transformer

Tao Han, Zhenghao Chen, Song Guo, Wanghan Xu, Lei Bai

TL;DR

This work introduces CRA5, an extreme compressor for climate reanalysis data based on the VAEformer, a variational autoencoder transformer that leverages variance inference and a Gaussian prior to enable efficient cross-entropy coding. By employing Atmospheric Circulation Transformer blocks with windowed attention, CRA5 achieves over a 300× compression of ERA5 data (226 TB to 0.7 TB) while preserving scientific utility, enabling AI-driven meteorological research on resource-constrained setups. The approach outperforms traditional and neural codecs in rate-distortion terms, and downstream weather forecasting models trained on CRA5 perform comparably to those trained on the full ERA5 dataset. The work also provides comprehensive architectural details, evaluation metrics, visualizations, and a candid discussion of limitations and societal impacts, with code and data openly available at the project repository.

Abstract

The advent of data-driven weather forecasting models, which learn from hundreds of terabytes (TB) of reanalysis data, has significantly advanced forecasting capabilities. However, the substantial costs associated with data storage and transmission present a major challenge for data providers and users, affecting resource-constrained researchers and limiting their accessibility to participate in AI-based meteorological research. To mitigate this issue, we introduce an efficient neural codec, the Variational Autoencoder Transformer (VAEformer), for extreme compression of climate data to significantly reduce data storage cost, making AI-based meteorological research portable to researchers. Our approach diverges from recent complex neural codecs by utilizing a low-complexity Auto-Encoder transformer. This encoder produces a quantized latent representation through variance inference, which reparameterizes the latent space as a Gaussian distribution. This method improves the estimation of distributions for cross-entropy coding. Extensive experiments demonstrate that our VAEformer outperforms existing state-of-the-art compression methods in the context of climate data. By applying our VAEformer, we compressed the most popular ERA5 climate dataset (226 TB) into a new dataset, CRA5 (0.7 TB). This translates to a compression ratio of over 300 while retaining the dataset's utility for accurate scientific analysis. Further, downstream experiments show that global weather forecasting models trained on the compact CRA5 dataset achieve forecasting accuracy comparable to the model trained on the original dataset. Code, the CRA5 dataset, and the pre-trained model are available at https://github.com/taohan10200/CRA5.

CRA5: Extreme Compression of ERA5 for Portable Global Climate and Weather Research via an Efficient Variational Transformer

TL;DR

This work introduces CRA5, an extreme compressor for climate reanalysis data based on the VAEformer, a variational autoencoder transformer that leverages variance inference and a Gaussian prior to enable efficient cross-entropy coding. By employing Atmospheric Circulation Transformer blocks with windowed attention, CRA5 achieves over a 300× compression of ERA5 data (226 TB to 0.7 TB) while preserving scientific utility, enabling AI-driven meteorological research on resource-constrained setups. The approach outperforms traditional and neural codecs in rate-distortion terms, and downstream weather forecasting models trained on CRA5 perform comparably to those trained on the full ERA5 dataset. The work also provides comprehensive architectural details, evaluation metrics, visualizations, and a candid discussion of limitations and societal impacts, with code and data openly available at the project repository.

Abstract

The advent of data-driven weather forecasting models, which learn from hundreds of terabytes (TB) of reanalysis data, has significantly advanced forecasting capabilities. However, the substantial costs associated with data storage and transmission present a major challenge for data providers and users, affecting resource-constrained researchers and limiting their accessibility to participate in AI-based meteorological research. To mitigate this issue, we introduce an efficient neural codec, the Variational Autoencoder Transformer (VAEformer), for extreme compression of climate data to significantly reduce data storage cost, making AI-based meteorological research portable to researchers. Our approach diverges from recent complex neural codecs by utilizing a low-complexity Auto-Encoder transformer. This encoder produces a quantized latent representation through variance inference, which reparameterizes the latent space as a Gaussian distribution. This method improves the estimation of distributions for cross-entropy coding. Extensive experiments demonstrate that our VAEformer outperforms existing state-of-the-art compression methods in the context of climate data. By applying our VAEformer, we compressed the most popular ERA5 climate dataset (226 TB) into a new dataset, CRA5 (0.7 TB). This translates to a compression ratio of over 300 while retaining the dataset's utility for accurate scientific analysis. Further, downstream experiments show that global weather forecasting models trained on the compact CRA5 dataset achieve forecasting accuracy comparable to the model trained on the original dataset. Code, the CRA5 dataset, and the pre-trained model are available at https://github.com/taohan10200/CRA5.
Paper Structure (9 sections, 4 equations, 8 figures, 6 tables)

This paper contains 9 sections, 4 equations, 8 figures, 6 tables.

Figures (8)

  • Figure S1: The Rate-Distortion (RD) performance on test data, "ERA5: 128$\times$256, 2022 year", between those NIC methods (bmsj2018, cheng2020, mbt2018, STF2022, ELIC2022, TCM2023 and VAEformer (ours)) and traditional image codec JPEG2000. Here, the degree of distortion is measured based on the mean squared error (MSE).
  • Figure S2: Visualization samples of t2m on the ERA5 and th ecompressed CRA5. From the left to the right column: ERA5, CRA5, and their mean absolute error map.
  • Figure S3: Visualization samples of t2m on the ERA5 and th ecompressed CRA5. From the left to the right column: ERA5, CRA5, and their mean absolute error map.
  • Figure S4: Visualization samples of z500 on the ERA5 and th ecompressed CRA5. From the left to the right column: ERA5, CRA5, and their mean absolute error map.
  • Figure S5: Visualization samples of msl on the ERA5 and th ecompressed CRA5. From the left to the right column: ERA5, CRA5, and their mean absolute error map.
  • ...and 3 more figures