Table of Contents
Fetching ...

Physics-Aware, Shannon-Optimal Compression via Arithmetic Coding for Distributional Fidelity

Cristiano Fanelli

TL;DR

This work proposes the use of arithmetic coding to provide a lossless and invertible compression of datasets under a physics-informed probabilistic representation, and establishes lossless, physics-aware compression based on arithmetic coding not as an end in itself, but as a measurement instrument for testing the fidelity between datasets.

Abstract

Assessing whether two datasets are distributionally consistent has become a central theme in modern scientific analysis, particularly as generative artificial intelligence is increasingly used to produce synthetic datasets whose fidelity must be rigorously validated against the original data on which they are trained, a task made more challenging by the continued growth in data volume and problem dimensionality. In this work, we propose the use of arithmetic coding to provide a lossless and invertible compression of datasets under a physics-informed probabilistic representation. Datasets that share the same underlying physical correlations admit comparable optimal descriptions, while discrepancies in those correlations-arising from miscalibration, mismodeling, or bias-manifest as an irreducible excess in code length. This excess codelength defines an operational fidelity metric, quantified directly in bits through differences in achievable compression length relative to a physics-inspired reference distribution. We demonstrate that this metric is global, interpretable, additive across components, and asymptotically optimal in the Shannon sense. Moreover, we show that differences in codelength correspond to differences in expected negative log-likelihood evaluated under the same physics-informed reference model. As a byproduct, we also demonstrate that our compression approach achieves a higher compression ratio than traditional general-purpose algorithms such as gzip. Our results establish lossless, physics-aware compression based on arithmetic coding not as an end in itself, but as a measurement instrument for testing the fidelity between datasets.

Physics-Aware, Shannon-Optimal Compression via Arithmetic Coding for Distributional Fidelity

TL;DR

This work proposes the use of arithmetic coding to provide a lossless and invertible compression of datasets under a physics-informed probabilistic representation, and establishes lossless, physics-aware compression based on arithmetic coding not as an end in itself, but as a measurement instrument for testing the fidelity between datasets.

Abstract

Assessing whether two datasets are distributionally consistent has become a central theme in modern scientific analysis, particularly as generative artificial intelligence is increasingly used to produce synthetic datasets whose fidelity must be rigorously validated against the original data on which they are trained, a task made more challenging by the continued growth in data volume and problem dimensionality. In this work, we propose the use of arithmetic coding to provide a lossless and invertible compression of datasets under a physics-informed probabilistic representation. Datasets that share the same underlying physical correlations admit comparable optimal descriptions, while discrepancies in those correlations-arising from miscalibration, mismodeling, or bias-manifest as an irreducible excess in code length. This excess codelength defines an operational fidelity metric, quantified directly in bits through differences in achievable compression length relative to a physics-inspired reference distribution. We demonstrate that this metric is global, interpretable, additive across components, and asymptotically optimal in the Shannon sense. Moreover, we show that differences in codelength correspond to differences in expected negative log-likelihood evaluated under the same physics-informed reference model. As a byproduct, we also demonstrate that our compression approach achieves a higher compression ratio than traditional general-purpose algorithms such as gzip. Our results establish lossless, physics-aware compression based on arithmetic coding not as an end in itself, but as a measurement instrument for testing the fidelity between datasets.
Paper Structure (22 sections, 20 equations, 5 figures, 6 tables)

This paper contains 22 sections, 20 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Comparison between original and decoded detector-level observables. (Top row) ADC distributions by detector layer for the original data (left) and the decoded data after compression (right). (Bottom row) Hit multiplicity as a function of particle momentum $p$ for the original (left) and decoded data (right). Arithmetic coding is lossless and invertible, and the exact agreement across all panels demonstrates that the compression–decompression cycle preserves both low-level and derived detector features. This result holds for both codecs (unconditional and conditional to the particle kinematics).
  • Figure 2: Bit-budget decomposition for unconditional arithmetic coding. The achieved codelength per event is decomposed by calorimeter layer and stereo view (U/V/W), and further split into occupancy, strip, and ADC contributions. The sum over all layer--views yields the total detector-level contribution to the achieved codelength.
  • Figure 3: Bit-budget decomposition for conditional arithmetic coding. The achieved codelength per event is decomposed by calorimeter layer and stereo view (U/V/W), and further split into occupancy, strip, and ADC contributions. Conditioning the hit model on particle kinematics modifies the distribution of bits across layers while preserving the same additive decomposition.
  • Figure 4: Sensitivity of arithmetic coding and MMD to ADC scale perturbations $\varepsilon$. Mean excess codelength $\Delta L$ (left axis) for unconditional and conditional arithmetic coding is compared to the corresponding change in $\Delta \mathrm{MMD}^2$ (right axis). AC shows a smooth, monotonic response to increasing perturbations, while MMD remains relatively insensitive at small $\varepsilon$ and increases sharply only at larger deviations.
  • Figure 5: Fidelity tests under controlled ADC scale perturbations $\varepsilon$. One-sided t-test $p$-values are shown for unconditional and conditional arithmetic coding (AC) and for the MMD-based test as a function of the ADC perturbation $\varepsilon$. The horizontal dashed line indicates the $p=0.05$ significance threshold. Conditional AC detects statistically significant deviations at substantially smaller $\varepsilon$ than MMD. The right axis reports the empirical fraction of ADC values that change under the perturbation, illustrating the relationship between physical modification rate and statistical sensitivity.