Table of Contents
Fetching ...

Compressing high-resolution data through latent representation encoding for downscaling large-scale AI weather forecast model

Qian Liu, Bing Gong, Xiaoran Zhuang, Xiaohui Zhong, Zhiming Kang, Hao Li

TL;DR

A variational autoencoder (VAE) framework tailored for compressing high-resolution datasets, specifically the High Resolution China Meteorological Administration Land Data Assimilation System (HRCLDAS) with a spatial resolution of 1 km is proposed and successfully reduced the storage size of 3 years of HRCLDAS data.

Abstract

The rapid advancement of artificial intelligence (AI) in weather research has been driven by the ability to learn from large, high-dimensional datasets. However, this progress also poses significant challenges, particularly regarding the substantial costs associated with processing extensive data and the limitations of computational resources. Inspired by the Neural Image Compression (NIC) task in computer vision, this study seeks to compress weather data to address these challenges and enhance the efficiency of downstream applications. Specifically, we propose a variational autoencoder (VAE) framework tailored for compressing high-resolution datasets, specifically the High Resolution China Meteorological Administration Land Data Assimilation System (HRCLDAS) with a spatial resolution of 1 km. Our framework successfully reduced the storage size of 3 years of HRCLDAS data from 8.61 TB to just 204 GB, while preserving essential information. In addition, we demonstrated the utility of the compressed data through a downscaling task, where the model trained on the compressed dataset achieved accuracy comparable to that of the model trained on the original data. These results highlight the effectiveness and potential of the compressed data for future weather research.

Compressing high-resolution data through latent representation encoding for downscaling large-scale AI weather forecast model

TL;DR

A variational autoencoder (VAE) framework tailored for compressing high-resolution datasets, specifically the High Resolution China Meteorological Administration Land Data Assimilation System (HRCLDAS) with a spatial resolution of 1 km is proposed and successfully reduced the storage size of 3 years of HRCLDAS data.

Abstract

The rapid advancement of artificial intelligence (AI) in weather research has been driven by the ability to learn from large, high-dimensional datasets. However, this progress also poses significant challenges, particularly regarding the substantial costs associated with processing extensive data and the limitations of computational resources. Inspired by the Neural Image Compression (NIC) task in computer vision, this study seeks to compress weather data to address these challenges and enhance the efficiency of downstream applications. Specifically, we propose a variational autoencoder (VAE) framework tailored for compressing high-resolution datasets, specifically the High Resolution China Meteorological Administration Land Data Assimilation System (HRCLDAS) with a spatial resolution of 1 km. Our framework successfully reduced the storage size of 3 years of HRCLDAS data from 8.61 TB to just 204 GB, while preserving essential information. In addition, we demonstrated the utility of the compressed data through a downscaling task, where the model trained on the compressed dataset achieved accuracy comparable to that of the model trained on the original data. These results highlight the effectiveness and potential of the compressed data for future weather research.

Paper Structure

This paper contains 11 sections, 8 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: (Illustration of the proposed framework.
  • Figure 2: Example reconstructed fields by VAE (fine-tune) for T$_{2M}$, U$_{10M}$ and V$_{10M}$ respectively on 9th September 2021, 00 UTC. (a) T$_{2M}$ from HRCLDAS, (b) the reconstructed T$_{2M}$ field by VAEs (fine-tune), (c) differences between HRCLDA T$_{2M}$ and reconstructed T$_{2M}$, (d) U$_{10M}$ from HRCLDAS, (e) the reconstructed U$_{10M}$ field by VAEs (fine-tune), (f) differences between HRCLDA U$_{10M}$ and reconstructed U$_{10M}$, (g) V$_{10M}$ from HRCLDAS, (h) the reconstructed V$_{10M}$ field by VAEs (fine-tune), (g) differences between HRCLDA V$_{10M}$ and reconstructed V$_{10M}$.
  • Figure 3: Displaying of log(density) plot for (a) T$_{2M}$, (b) U$_{10M}$, and (c) V$_{10M}$ by the VAE method comparing to the resize method and HRCLDAS data.
  • Figure 4: Box-and-whisker plot for comparision in terms of MSE between baseline and machine learning methods for (a) T$_{2M}$, (b) U$_{10M}$, and (c) V$_{10M}$ with lead time of 1h to 18h. The solid horizon indicates the minimum and maximum MSE, excluding outliers; the box bounds the interquartile range from the 25th to 75th percentiles, with the 50th percentile. The red color indicates the baseline model "resize"; the green color indicates the U-Net trained on original HRCLDAS data; the blue color indicates the U-Net trained on compact HRCLDAS data generated by VAE.
  • Figure 5: The Box-and-whisker plot for comparsion in terms of SSIM between baseline and machine learning methods for (a) T$_{2M}$, (b) U$_{10M}$, and (c) V$_{10M}$ with a lead time of 1 h to 18 h. The solid horizon indicates the minimum and maximum MSE, excluding outliers; the box bounds the interquartile range from the 25th to 75th percentiles, with the 50th percentile. The red color indicates the baseline model "resize"; the green color indicates the U-Net trained on original HRCLDAS data; the blue color indicates the U-Net trained on compact HRCLDAS data generated by VAE.
  • ...and 5 more figures