Table of Contents
Fetching ...

Lossy Compression of Scientific Data: Applications Constrains and Requirements

Franck Cappello, Allison Baker, Ebru Bozda, Martin Burtscher, Kyle Chard, Sheng Di, Paul Christopher O Grady, Peng Jiang, Shaomeng Li, Erik Lindahl, Peter Lindstrom, Magnus Lundborg, Kai Zhao, Xin Liang, Masaru Nagaso, Kento Sato, Amarjit Singh, Seung Woo Son, Dingwen Tao, Jiannan Tian, Robert Underwood, Kazutomo Yoshii, Danylo Lykov, Yuri Alexeev, Kyle Gerard Felker

TL;DR

This paper surveys the escalating data volumes in scientific computing and argues for lossy compression as a viable data-reduction strategy that preserves quantities of interest. It catalogs nine diverse application domains and reviews eight leading compression technologies, linking application-specific QoIs to performance and fidelity requirements. The study highlights gaps in topology, derived QoIs, and distributional preservation, and underscores needs for automated configuration, cross-hardware portability, and long-term data format stability. By outlining concrete demands and existing capabilities, the work aims to guide future research toward scalable, QoI-aware lossy compression in HPC contexts.

Abstract

Increasing data volumes from scientific simulations and instruments (supercomputers, accelerators, telescopes) often exceed network, storage, and analysis capabilities. The scientific community's response to this challenge is scientific data reduction. Reduction can take many forms, such as triggering, sampling, filtering, quantization, and dimensionality reduction. This report focuses on a specific technique: lossy compression. Lossy compression retains all data points, leveraging correlations and controlled reduced accuracy. Quality constraints, especially for quantities of interest, are crucial for preserving scientific discoveries. User requirements also include compression ratio and speed. While many papers have been published on lossy compression techniques and reference datasets are shared by the community, there is a lack of detailed specifications of application needs that can guide lossy compression researchers and developers. This report fills this gap by reporting on the requirements and constraints of nine scientific applications covering a large spectrum of domains (climate, combustion, cosmology, fusion, light sources, molecular dynamics, quantum circuit simulation, seismology, and system logs). The report also details key lossy compression technologies (SZ, ZFP, MGARD, LC, SPERR, DCTZ, TEZip, LibPressio), discussing their history, principles, error control, hardware support, features, and impact. By presenting both application needs and compression technologies, the report aims to inspire new research to fill existing gaps.

Lossy Compression of Scientific Data: Applications Constrains and Requirements

TL;DR

This paper surveys the escalating data volumes in scientific computing and argues for lossy compression as a viable data-reduction strategy that preserves quantities of interest. It catalogs nine diverse application domains and reviews eight leading compression technologies, linking application-specific QoIs to performance and fidelity requirements. The study highlights gaps in topology, derived QoIs, and distributional preservation, and underscores needs for automated configuration, cross-hardware portability, and long-term data format stability. By outlining concrete demands and existing capabilities, the work aims to guide future research toward scalable, QoI-aware lossy compression in HPC contexts.

Abstract

Increasing data volumes from scientific simulations and instruments (supercomputers, accelerators, telescopes) often exceed network, storage, and analysis capabilities. The scientific community's response to this challenge is scientific data reduction. Reduction can take many forms, such as triggering, sampling, filtering, quantization, and dimensionality reduction. This report focuses on a specific technique: lossy compression. Lossy compression retains all data points, leveraging correlations and controlled reduced accuracy. Quality constraints, especially for quantities of interest, are crucial for preserving scientific discoveries. User requirements also include compression ratio and speed. While many papers have been published on lossy compression techniques and reference datasets are shared by the community, there is a lack of detailed specifications of application needs that can guide lossy compression researchers and developers. This report fills this gap by reporting on the requirements and constraints of nine scientific applications covering a large spectrum of domains (climate, combustion, cosmology, fusion, light sources, molecular dynamics, quantum circuit simulation, seismology, and system logs). The report also details key lossy compression technologies (SZ, ZFP, MGARD, LC, SPERR, DCTZ, TEZip, LibPressio), discussing their history, principles, error control, hardware support, features, and impact. By presenting both application needs and compression technologies, the report aims to inspire new research to fill existing gaps.

Paper Structure

This paper contains 142 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Diagram of FWI workflow. In a iteration loop, there are two I/O peak timings indicated by the red arrows. The first peak occurs when all the simultaneous forward simulations store their snapshots on physical storage. The second peak occurs when the simultaneous adjoint simulations read those snapshots to recreate the wave fields.
  • Figure 2: Illustration of the big data issue in preserving forward propagation wave snapshots during runtime of reverse time migration (RTM) execution.
  • Figure 3: LC's process of chaining (i.e., pipelining) $n$ data transformations to form a custom compression algorithm and the inverses of those transformations to form the matching decompression algorithm (the components are lossless whereas the preprocessors include guaranteed-error-bounded lossy quantizers).
  • Figure 4: TEZip (Time Evolutionary Zip) framework