A Survey on Error-Bounded Lossy Compression for Scientific Datasets

Sheng Di; Jinyang Liu; Kai Zhao; Xin Liang; Robert Underwood; Zhaorui Zhang; Milan Shah; Yafan Huang; Jiajun Huang; Xiaodong Yu; Congrong Ren; Hanqi Guo; Grant Wilkins; Dingwen Tao; Jiannan Tian; Sian Jin; Zizhe Jian; Daoce Wang; MD Hasanur Rahman; Boyuan Zhang; Shihui Song; Jon C. Calhoun; Guanpeng Li; Kazutomo Yoshii; Khalid Ayed Alharthi; Franck Cappello

A Survey on Error-Bounded Lossy Compression for Scientific Datasets

Sheng Di, Jinyang Liu, Kai Zhao, Xin Liang, Robert Underwood, Zhaorui Zhang, Milan Shah, Yafan Huang, Jiajun Huang, Xiaodong Yu, Congrong Ren, Hanqi Guo, Grant Wilkins, Dingwen Tao, Jiannan Tian, Sian Jin, Zizhe Jian, Daoce Wang, MD Hasanur Rahman, Boyuan Zhang, Shihui Song, Jon C. Calhoun, Guanpeng Li, Kazutomo Yoshii, Khalid Ayed Alharthi, Franck Cappello

TL;DR

This survey addresses the challenge of storing and transferring voluminous scientific data through error-bounded lossy compression. It develops a six-model taxonomy, catalogs ten modular techniques, and reviews 46 compressors (general-purpose and domain-tailored), linking design choices to performance and reconstruction quality. The work emphasizes QoI preservation and practical guidance for selecting compressors across HPC, climate, cosmology, and other domains, highlighting the benefits of co-design with domain experts. It also charts future research directions such as in-situ compression on accelerators, exploration of lossless methods, and adapting to data with varying autocorrelation.

Abstract

Error-bounded lossy compression has been effective in significantly reducing the data storage/transfer burden while preserving the reconstructed data fidelity very well. Many error-bounded lossy compressors have been developed for a wide range of parallel and distributed use cases for years. They are designed with distinct compression models and principles, such that each of them features particular pros and cons. In this paper we provide a comprehensive survey of emerging error-bounded lossy compression techniques. The key contribution is fourfold. (1) We summarize a novel taxonomy of lossy compression into 6 classic models. (2) We provide a comprehensive survey of 10 commonly used compression components/modules. (3) We summarized pros and cons of 46 state-of-the-art lossy compressors and present how state-of-the-art compressors are designed based on different compression techniques. (4) We discuss how customized compressors are designed for specific scientific applications and use-cases. We believe this survey is useful to multiple communities including scientific applications, high-performance computing, lossy compression, and big data.

A Survey on Error-Bounded Lossy Compression for Scientific Datasets

TL;DR

Abstract

Paper Structure (7 sections, 5 figures, 7 tables)

This paper contains 7 sections, 5 figures, 7 tables.

Introduction
Related Work
Compression Model Taxonomy
Modular Lossy Compression Techniques
General-Purpose Lossy Compressors for Scientific Data
Customized Compressors for Specific Applications or Use Cases
Conclusion and Future Work

Figures (5)

Figure 1: Scientific Lossy Compression Model Taxonomy
Figure 2: Compression Pipeline with Various Models: each highlighted box represents a corresponding model. All the compression techniques shown here will be detailed in Section \ref{['sec:techniques']}.
Figure 3: Usage Distribution of Compression Techniques in 46 Compressors (e.g., 25.7% compressors in Table \ref{['tab:pipelines']} used LE).
Figure 4: RRS Policy with Six Prediction Methods. SZ1, SZ2, SZ3, SPERR are general-purpose compressors (see Table \ref{['tab:pipelines']}); Pastri pastri and MDZ mdz are customized compressors for chemistry applications (Sec. 6.1 & 6.2).
Figure 5: Illustration of Autoencoder (AE): Encoder and decoder are two models trained with datasets, and latent vector represents the space in compressed format.

A Survey on Error-Bounded Lossy Compression for Scientific Datasets

TL;DR

Abstract

A Survey on Error-Bounded Lossy Compression for Scientific Datasets

Authors

TL;DR

Abstract

Table of Contents

Figures (5)