Table of Contents
Fetching ...

An Error-Bounded Lossy Compression Method with Bit-Adaptive Quantization for Particle Data

Congrong Ren, Sheng Di, Longtao Zhang, Kai Zhao, Hanqi Guo

TL;DR

This work tackles the challenge of reducing storage for trillion-point particle datasets under strict pointwise accuracy by introducing an error-bounded lossy compression framework that preserves all particles. It combines KD-tree–driven partitioning with adaptive bit-depth bit boxes to encode particle positions within a user-defined bound $\epsilon$, and organizes data into sequences with reordering and Huffman/ZSTD compression to maximize efficiency. Key contributions include the bit-box construction with per-dimension bit counts $m_d$ and lengths $l_d$, a box-intersection query for efficient overlap management, and a sequence-based data layout with targeted reordering that yields superior rate-distortion performance versus SZ and MDZ across cosmology, fluid dynamics, and fusion-plasma datasets. The method demonstrates substantial compression gains and high fidelity, enabling scalable storage, visualization, and analysis for large-scale particle simulations, with clear directions for progressive compression and dynamic updates in future work.

Abstract

This paper presents error-bounded lossy compression tailored for particle datasets from diverse scientific applications in cosmology, fluid dynamics, and fusion energy sciences. As today's high-performance computing capabilities advance, these datasets often reach trillions of points, posing significant visualization, analysis, and storage challenges. While error-bounded lossy compression makes it possible to represent floating-point values with strict pointwise accuracy guarantees, the lack of correlations in particle data's storage ordering often limits the compression ratio. Inspired by quantization-encoding schemes in SZ lossy compressors, we dynamically determine the number of bits to encode particles of the dataset to increase the compression ratio. Specifically, we utilize a k-d tree to partition particles into subregions and generate ``bit boxes'' centered at particles for each subregion to encode their positions. These bit boxes ensure error control while reducing the bit count used for compression. We comprehensively evaluate our method against state-of-the-art compressors on cosmology, fluid dynamics, and fusion plasma datasets.

An Error-Bounded Lossy Compression Method with Bit-Adaptive Quantization for Particle Data

TL;DR

This work tackles the challenge of reducing storage for trillion-point particle datasets under strict pointwise accuracy by introducing an error-bounded lossy compression framework that preserves all particles. It combines KD-tree–driven partitioning with adaptive bit-depth bit boxes to encode particle positions within a user-defined bound , and organizes data into sequences with reordering and Huffman/ZSTD compression to maximize efficiency. Key contributions include the bit-box construction with per-dimension bit counts and lengths , a box-intersection query for efficient overlap management, and a sequence-based data layout with targeted reordering that yields superior rate-distortion performance versus SZ and MDZ across cosmology, fluid dynamics, and fusion-plasma datasets. The method demonstrates substantial compression gains and high fidelity, enabling scalable storage, visualization, and analysis for large-scale particle simulations, with clear directions for progressive compression and dynamic updates in future work.

Abstract

This paper presents error-bounded lossy compression tailored for particle datasets from diverse scientific applications in cosmology, fluid dynamics, and fusion energy sciences. As today's high-performance computing capabilities advance, these datasets often reach trillions of points, posing significant visualization, analysis, and storage challenges. While error-bounded lossy compression makes it possible to represent floating-point values with strict pointwise accuracy guarantees, the lack of correlations in particle data's storage ordering often limits the compression ratio. Inspired by quantization-encoding schemes in SZ lossy compressors, we dynamically determine the number of bits to encode particles of the dataset to increase the compression ratio. Specifically, we utilize a k-d tree to partition particles into subregions and generate ``bit boxes'' centered at particles for each subregion to encode their positions. These bit boxes ensure error control while reducing the bit count used for compression. We comprehensively evaluate our method against state-of-the-art compressors on cosmology, fluid dynamics, and fusion plasma datasets.
Paper Structure (14 sections, 5 equations, 11 figures, 2 tables, 1 algorithm)

This paper contains 14 sections, 5 equations, 11 figures, 2 tables, 1 algorithm.

Figures (11)

  • Figure 1: Linear-scaling quantization. The quantization code is $2^{m-1}-2$ for the value in this example. Image reproduced from Figure 2 in tao2017significantly.
  • Figure 2: Workflow of our compression and decompression algorithms for particle position data.
  • Figure 3: Illustration of our bit box reduction algorithm for particle position compression by a 2D example. (a) A set of particles. (b) Particles are divided into groups via k-d tree splitting at the median. The red lines represent the splitting (hyper)plane, while the red dots indicate the medians. Numbers in red denote the order of splitting. The splitting process terminates when each subregion contains no more than $r$ particles, with $r$ being an integer ranging from 6 to 11 in this example. (c) We start with selecting centers for bit boxes. Initially, AABBs, delineated by red dashed lines, are created for the subregions. A particle, indicated by a solid red dot, closest to the center (represented by a hollow red dot) of the AABB, is chosen as the center of the corresponding bit box for the subregion. (d) We construct bit boxes for subregions centered at the solid red dots, as indicated in (c), with lengths calculated based on Equations \ref{['eq:num_bits']} and \ref{['eq:box_length']}. Particles that fall within the overlap of bit boxes are highlighted in blue. (e) The bit box (shown as the solid red box) with the smallest sum of lengths across all dimensions is selected for elimination. The center of the bit box is losslessly stored. In contrast, all other particles inside the bit box, including those initially belonging to other subregions (e.g., the blue dot), are quantized using the numbers of bits determined by \ref{['eq:num_bits']}. All intersecting bit boxes are highlighted by thick lines. (f) All intersecting bit boxes are updated, as removing particles from their respective subregions may reduce the sizes (and potentially change the centers) for these bit boxes.
  • Figure 4: The total number of bits across all dimensions ($\sum_d m_d$) for quantization codes of bit boxes for NYX dataset with $\xi=0.005$ and $\tilde{r}=0.05\%$. $\sum_d m_d$ is not monotonically increasing, and the number of removed bit boxes (i.e., 1,993) is smaller than the total number of bit boxes (i.e., 2,000). See explanation in the last paragraph in \ref{['sec:bit_box']}.
  • Figure 5: Data layout of sequences
  • ...and 6 more figures