Table of Contents
Fetching ...

BOA Constrictor: A Mamba-based lossless compressor for High Energy Physics data

Akshat Gupta, Caterina Doglioni, Thomas Joseph Elliott

TL;DR

This work presents the Bytewise Online Autoregressive (BOA) Constrictor, a novel, streaming-capable lossless compressor built upon the Mamba architecture, and concludes that while this Mamba-based approach is a highly promising proof-of-principle, significant future work on performance optimisation and hardware portability is required to develop it into a production-ready tool for the HEP community.

Abstract

The petabyte-scale data generated annually by High Energy Physics (HEP) experiments like those at the Large Hadron Collider present a significant data storage challenge. Whilst traditional algorithms like LZMA and ZLIB are widely used, they often fail to exploit the deep structure inherent in scientific data. We investigate the application of modern state space models (SSMs) to this problem, which have shown promise for capturing long-range dependencies in sequences. We present the Bytewise Online Autoregressive (BOA) Constrictor, a novel, streaming-capable lossless compressor built upon the Mamba architecture. BOA combines an autoregressive Mamba model for next-byte prediction with a parallelised streaming range coder. We evaluate our method on three distinct structured datasets in HEP, demonstrating state-of-the-art compression ratios, improving upon LZMA-9 across all datasets. These improvements range from 2.21$\times$ (vs. 1.69$\times$) on the ATLAS dataset to a substantial 44.14$\times$ (vs. 27.14$\times$) on the highly-structured CMS dataset, with a modest $\sim 4.5$MB model size. However, this gain in compression ratio comes with a trade-off in throughput; the Storage-Saving Rate ($σ_{SSR}$) of our prototype currently lags behind highly-optimised CPU-based algorithms like ZLIB. We conclude that while this Mamba-based approach is a highly promising proof-of-principle, significant future work on performance optimisation and hardware portability is required to develop it into a production-ready tool for the HEP community.

BOA Constrictor: A Mamba-based lossless compressor for High Energy Physics data

TL;DR

This work presents the Bytewise Online Autoregressive (BOA) Constrictor, a novel, streaming-capable lossless compressor built upon the Mamba architecture, and concludes that while this Mamba-based approach is a highly promising proof-of-principle, significant future work on performance optimisation and hardware portability is required to develop it into a production-ready tool for the HEP community.

Abstract

The petabyte-scale data generated annually by High Energy Physics (HEP) experiments like those at the Large Hadron Collider present a significant data storage challenge. Whilst traditional algorithms like LZMA and ZLIB are widely used, they often fail to exploit the deep structure inherent in scientific data. We investigate the application of modern state space models (SSMs) to this problem, which have shown promise for capturing long-range dependencies in sequences. We present the Bytewise Online Autoregressive (BOA) Constrictor, a novel, streaming-capable lossless compressor built upon the Mamba architecture. BOA combines an autoregressive Mamba model for next-byte prediction with a parallelised streaming range coder. We evaluate our method on three distinct structured datasets in HEP, demonstrating state-of-the-art compression ratios, improving upon LZMA-9 across all datasets. These improvements range from 2.21 (vs. 1.69) on the ATLAS dataset to a substantial 44.14 (vs. 27.14) on the highly-structured CMS dataset, with a modest MB model size. However, this gain in compression ratio comes with a trade-off in throughput; the Storage-Saving Rate () of our prototype currently lags behind highly-optimised CPU-based algorithms like ZLIB. We conclude that while this Mamba-based approach is a highly promising proof-of-principle, significant future work on performance optimisation and hardware portability is required to develop it into a production-ready tool for the HEP community.

Paper Structure

This paper contains 25 sections, 17 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: Conceptual diagram of our approach to the compression technique. Multiple boxes signify parallel streams except within the Model where they represent multiple blocks.
  • Figure 2: Streaming pipeline: buffer chunks in parallel (blue), compress in parallel (green), and pre-load the next compressor set (red).
  • Figure 3: Normalised confusion matrices of the 20 most common bytes: (a) CMS, (b) ATLAS, (c) HEPMC, (d) Bundled CMS.
  • Figure 4: Top-$k$ prediction accuracy for each dataset: (a) CMS, (b) ATLAS, (c) HEPMC, (d) Bundled CMS. Here, Top-$k$ is the fraction of bytes whose true value lies within the $k$ most probable model predictions; higher Top-1/Top-$k$ imply lower cross-entropy and thus better compression.
  • Figure 5: Reliability diagrams with residuals for each dataset: (a) CMS, (b) ATLAS, (c) HEPMC, (d) Bundled CMS. $\Delta$ denotes the gap between empirical accuracy and the perfect calibration line.
  • ...and 1 more figures