Scientific Data Compression and Super-Resolution Sampling
Minh Vu, Andrey Lokhov
TL;DR
The paper tackles the challenge of storing and processing massive scientific datasets while preserving physically meaningful quantities of interest (QoIs) and enabling reconstruction through super-resolution. It proposes a physics-informed framework built on exponential-family graphical models, where QoIs drive the sufficient statistics and the energy $E(\vec{x})=\boldsymbol{\theta}\cdot Q(\vec{x})$ defines the distribution $P(\vec{x})=\frac{1}{Z}\exp(E(\vec{x}))$. QoI-preserving compression combines a lossy DCT-based encoding (with compression level $C=1-E_{presv}$) and a post-decompression MCMC correction, initialized from the decompressed data to efficiently sample from the learned model. The approach employs discrete learning via GRISE with the ISO objective for Ising models and continuous learning via Score Matching, followed by Glauber or Langevin dynamics for sampling, and a warm-start strategy for super-resolution sampling. Across discrete and continuous benchmarks, including D-Wave data, $\\ Phi^4$ theory, and aluminum MD, the method achieves strong QoI recovery with relatively few correction steps and exhibits favorable polynomial scaling, illustrating practical applicability for checkpointing, inverse problems, and data-driven scientific inference.
Abstract
Modern scientific simulations, observations, and large-scale experiments generate data at volumes that often exceed the limits of storage, processing, and analysis. This challenge drives the development of data reduction methods that efficiently manage massive datasets while preserving essential physical features and quantities of interest. In many scientific workflows, it is also crucial to enable data recovery from compressed representations - a task known as super-resolution - with guarantees on the preservation of key physical characteristics. A notable example is checkpointing and restarting, which is essential for long-running simulations to recover from failures, resume after interruptions, or examine intermediate results. In this work, we introduce a novel framework for scientific data compression and super-resolution, grounded in recent advances in learning exponential families. Our method preserves and quantifies uncertainty in physical quantities of interest and supports flexible trade-offs between compression ratio and reconstruction fidelity.
