Table of Contents
Fetching ...

Hierarchical Autoencoder-based Lossy Compression for Large-scale High-resolution Scientific Data

Hieu Le, Jian Tao

TL;DR

This work tackles the storage burden of petabyte-scale scientific data by introducing a hierarchical autoencoder with vector quantization tailored for lossy compression of high-resolution datasets. The model employs a two-level encoder with dual quantizers, EMA-driven codebook updates, and a straight-through estimator to enable end-to-end training, achieving high compression ratios up to $CR \approx 140$–$200$ while preserving reconstruction fidelity for subsequent scientific analysis. Preprocessing (standardization, masking, and overlapping block partitioning) and an error-bounded mechanism ensure robust performance on large-scale climate data such as iHESP/CESM, with objective terms $L = \lambda_{recon} \cdot \text{mask} \cdot l_{recon} + \lambda_q \cdot l_q$, where $l_{recon} = \|x - \hat{x}\|_2$ and $l_q = \|z_e - z_q\|_2$, and hyperparameters $\lambda_q = 0.25$, $\lambda_{recon} = 2$. The approach demonstrates strong results on SDRBench and iHESP data, offering a practical, scalable solution for large-scale scientific data storage and retrieval with controllable distortion, paving the way for broader adoption of neural compression in scientific workflows.

Abstract

Lossy compression has become an important technique to reduce data size in many domains. This type of compression is especially valuable for large-scale scientific data, whose size ranges up to several petabytes. Although Autoencoder-based models have been successfully leveraged to compress images and videos, such neural networks have not widely gained attention in the scientific data domain. Our work presents a neural network that not only significantly compresses large-scale scientific data, but also maintains high reconstruction quality. The proposed model is tested with scientific benchmark data available publicly and applied to a large-scale high-resolution climate modeling data set. Our model achieves a compression ratio of 140 on several benchmark data sets without compromising the reconstruction quality. 2D simulation data from the High-Resolution Community Earth System Model (CESM) Version 1.3 over 500 years are also being compressed with a compression ratio of 200 while the reconstruction error is negligible for scientific analysis.

Hierarchical Autoencoder-based Lossy Compression for Large-scale High-resolution Scientific Data

TL;DR

This work tackles the storage burden of petabyte-scale scientific data by introducing a hierarchical autoencoder with vector quantization tailored for lossy compression of high-resolution datasets. The model employs a two-level encoder with dual quantizers, EMA-driven codebook updates, and a straight-through estimator to enable end-to-end training, achieving high compression ratios up to while preserving reconstruction fidelity for subsequent scientific analysis. Preprocessing (standardization, masking, and overlapping block partitioning) and an error-bounded mechanism ensure robust performance on large-scale climate data such as iHESP/CESM, with objective terms , where and , and hyperparameters , . The approach demonstrates strong results on SDRBench and iHESP data, offering a practical, scalable solution for large-scale scientific data storage and retrieval with controllable distortion, paving the way for broader adoption of neural compression in scientific workflows.

Abstract

Lossy compression has become an important technique to reduce data size in many domains. This type of compression is especially valuable for large-scale scientific data, whose size ranges up to several petabytes. Although Autoencoder-based models have been successfully leveraged to compress images and videos, such neural networks have not widely gained attention in the scientific data domain. Our work presents a neural network that not only significantly compresses large-scale scientific data, but also maintains high reconstruction quality. The proposed model is tested with scientific benchmark data available publicly and applied to a large-scale high-resolution climate modeling data set. Our model achieves a compression ratio of 140 on several benchmark data sets without compromising the reconstruction quality. 2D simulation data from the High-Resolution Community Earth System Model (CESM) Version 1.3 over 500 years are also being compressed with a compression ratio of 200 while the reconstruction error is negligible for scientific analysis.
Paper Structure (20 sections, 7 equations, 3 figures, 3 tables)

This paper contains 20 sections, 7 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Model Architecture
  • Figure 2: Components of a residual block
  • Figure 3: Compression performance on CESM 2D CLDHGH data