Table of Contents
Fetching ...

Efficient Compression of Sparse Accelerator Data Using Implicit Neural Representations and Importance Sampling

Xihaier Luo, Samuel Lurvey, Yi Huang, Yihui Ren, Jin Huang, Byung-Jun Yoon

TL;DR

The paper tackles the challenge of compressing extremely sparse, high-dimensional accelerator data by leveraging implicit neural representations (INRs) to learn continuous data representations and applying an importance sampling strategy to accelerate training. It compares three INR variants—SIREN, FFNet, and WIRE—with a baseline MLP, showing that SIREN yields the best continuous reconstruction and that INR-based compression can compete with traditional lossy compressors like MGARD, SZ, and ZFP, often with speed-ups. An additional contribution is the proposal and evaluation of sampling strategies, notably Importance Sampling, which prioritizes non-zero, information-rich data points to reduce training cost without sacrificing accuracy; entropy-based sampling offers an alternative with different trade-offs. The results demonstrate the practicality of INR-based compression for sparse scientific data, offering a scalable approach for real-time data reduction in detectors such as the sPHENIX TPC and enabling efficient storage and downstream analysis.

Abstract

High-energy, large-scale particle colliders in nuclear and high-energy physics generate data at extraordinary rates, reaching up to $1$ terabyte and several petabytes per second, respectively. The development of real-time, high-throughput data compression algorithms capable of reducing this data to manageable sizes for permanent storage is of paramount importance. A unique characteristic of the tracking detector data is the extreme sparsity of particle trajectories in space, with an occupancy rate ranging from approximately $10^{-6}$ to $10\%$. Furthermore, for downstream tasks, a continuous representation of this data is often more useful than a voxel-based, discrete representation due to the inherently continuous nature of the signals involved. To address these challenges, we propose a novel approach using implicit neural representations for data learning and compression. We also introduce an importance sampling technique to accelerate the network training process. Our method is competitive with traditional compression algorithms, such as MGARD, SZ, and ZFP, while offering significant speed-ups and maintaining negligible accuracy loss through our importance sampling strategy.

Efficient Compression of Sparse Accelerator Data Using Implicit Neural Representations and Importance Sampling

TL;DR

The paper tackles the challenge of compressing extremely sparse, high-dimensional accelerator data by leveraging implicit neural representations (INRs) to learn continuous data representations and applying an importance sampling strategy to accelerate training. It compares three INR variants—SIREN, FFNet, and WIRE—with a baseline MLP, showing that SIREN yields the best continuous reconstruction and that INR-based compression can compete with traditional lossy compressors like MGARD, SZ, and ZFP, often with speed-ups. An additional contribution is the proposal and evaluation of sampling strategies, notably Importance Sampling, which prioritizes non-zero, information-rich data points to reduce training cost without sacrificing accuracy; entropy-based sampling offers an alternative with different trade-offs. The results demonstrate the practicality of INR-based compression for sparse scientific data, offering a scalable approach for real-time data reduction in detectors such as the sPHENIX TPC and enabling efficient storage and downstream analysis.

Abstract

High-energy, large-scale particle colliders in nuclear and high-energy physics generate data at extraordinary rates, reaching up to terabyte and several petabytes per second, respectively. The development of real-time, high-throughput data compression algorithms capable of reducing this data to manageable sizes for permanent storage is of paramount importance. A unique characteristic of the tracking detector data is the extreme sparsity of particle trajectories in space, with an occupancy rate ranging from approximately to . Furthermore, for downstream tasks, a continuous representation of this data is often more useful than a voxel-based, discrete representation due to the inherently continuous nature of the signals involved. To address these challenges, we propose a novel approach using implicit neural representations for data learning and compression. We also introduce an importance sampling technique to accelerate the network training process. Our method is competitive with traditional compression algorithms, such as MGARD, SZ, and ZFP, while offering significant speed-ups and maintaining negligible accuracy loss through our importance sampling strategy.

Paper Structure

This paper contains 18 sections, 9 equations, 7 figures.

Figures (7)

  • Figure 1: Illustration of the working principal for the time projection chamber (TPC) of sPHENIX Experiment. For simplicity, a single charge particle is visualized, as it is produced at the collision point, and traverses through the TPC leaving ion-electron pairs along its trajectory. These ionization electrons drift along an electrical field to the end plate for amplification and readout. During experiment, thousands of particle can be produced at a single collision and tracks from multiple collisions can pile up onto each other in the TPC data.
  • Figure 2: Qualitative results of continuous reconstruction with super-resolution scales of $\times 4$ and $\times 8$. The $\times 4$ super-resolution is trained on $(96, 125, 16)$ and and the $\times 8$, on $(96, 125, 8)$. Both are evaluated on the full resolution $(192, 249, 16)$.
  • Figure 3: Panel A. MSE vs. compression ratio for conventional method (MGARD, SZ, and ZFP) nd INR approaches (SIREN, WIRE, and FFNet). Panel B. MSE vs. sampling ratio for different sampling methods based on the SIREN algorithm. Panel C. time vs. sampling ratio for different sampling methods.
  • Figure 4: Qualitative results of continuous reconstruction with super-resolution scales of $\times 1$. All INR models were trained on data with dimensions $192 \times 249 \times 16$ and evaluated on datasets of the same size.
  • Figure 5: Qualitative results of continuous reconstruction with super-resolution scales of $\times 4$. All INR models were trained on data with dimensions $96 \times 125 \times 16$ and evaluated on datasets with dimensions $192 \times 249 \times 16$.
  • ...and 2 more figures