Table of Contents
Fetching ...

Scientific Data Compression and Super-Resolution Sampling

Minh Vu, Andrey Lokhov

TL;DR

The paper tackles the challenge of storing and processing massive scientific datasets while preserving physically meaningful quantities of interest (QoIs) and enabling reconstruction through super-resolution. It proposes a physics-informed framework built on exponential-family graphical models, where QoIs drive the sufficient statistics and the energy $E(\vec{x})=\boldsymbol{\theta}\cdot Q(\vec{x})$ defines the distribution $P(\vec{x})=\frac{1}{Z}\exp(E(\vec{x}))$. QoI-preserving compression combines a lossy DCT-based encoding (with compression level $C=1-E_{presv}$) and a post-decompression MCMC correction, initialized from the decompressed data to efficiently sample from the learned model. The approach employs discrete learning via GRISE with the ISO objective for Ising models and continuous learning via Score Matching, followed by Glauber or Langevin dynamics for sampling, and a warm-start strategy for super-resolution sampling. Across discrete and continuous benchmarks, including D-Wave data, $\\ Phi^4$ theory, and aluminum MD, the method achieves strong QoI recovery with relatively few correction steps and exhibits favorable polynomial scaling, illustrating practical applicability for checkpointing, inverse problems, and data-driven scientific inference.

Abstract

Modern scientific simulations, observations, and large-scale experiments generate data at volumes that often exceed the limits of storage, processing, and analysis. This challenge drives the development of data reduction methods that efficiently manage massive datasets while preserving essential physical features and quantities of interest. In many scientific workflows, it is also crucial to enable data recovery from compressed representations - a task known as super-resolution - with guarantees on the preservation of key physical characteristics. A notable example is checkpointing and restarting, which is essential for long-running simulations to recover from failures, resume after interruptions, or examine intermediate results. In this work, we introduce a novel framework for scientific data compression and super-resolution, grounded in recent advances in learning exponential families. Our method preserves and quantifies uncertainty in physical quantities of interest and supports flexible trade-offs between compression ratio and reconstruction fidelity.

Scientific Data Compression and Super-Resolution Sampling

TL;DR

The paper tackles the challenge of storing and processing massive scientific datasets while preserving physically meaningful quantities of interest (QoIs) and enabling reconstruction through super-resolution. It proposes a physics-informed framework built on exponential-family graphical models, where QoIs drive the sufficient statistics and the energy defines the distribution . QoI-preserving compression combines a lossy DCT-based encoding (with compression level ) and a post-decompression MCMC correction, initialized from the decompressed data to efficiently sample from the learned model. The approach employs discrete learning via GRISE with the ISO objective for Ising models and continuous learning via Score Matching, followed by Glauber or Langevin dynamics for sampling, and a warm-start strategy for super-resolution sampling. Across discrete and continuous benchmarks, including D-Wave data, theory, and aluminum MD, the method achieves strong QoI recovery with relatively few correction steps and exhibits favorable polynomial scaling, illustrating practical applicability for checkpointing, inverse problems, and data-driven scientific inference.

Abstract

Modern scientific simulations, observations, and large-scale experiments generate data at volumes that often exceed the limits of storage, processing, and analysis. This challenge drives the development of data reduction methods that efficiently manage massive datasets while preserving essential physical features and quantities of interest. In many scientific workflows, it is also crucial to enable data recovery from compressed representations - a task known as super-resolution - with guarantees on the preservation of key physical characteristics. A notable example is checkpointing and restarting, which is essential for long-running simulations to recover from failures, resume after interruptions, or examine intermediate results. In this work, we introduce a novel framework for scientific data compression and super-resolution, grounded in recent advances in learning exponential families. Our method preserves and quantifies uncertainty in physical quantities of interest and supports flexible trade-offs between compression ratio and reconstruction fidelity.

Paper Structure

This paper contains 18 sections, 8 equations, 5 figures.

Figures (5)

  • Figure 1: A schematic representation of our proposed approach. (a) The compression step involves learning a compact representation of the data distribution by learning a model in the exponential family with conserved quantities of interest $Q(\vec{x})$, where the desired QoIs are used as sufficient statistics. We separately store a compressed version of the original data using lossy compression. (b) When needed, this compressed data is decoded to initialize sampling procedures from the learned model, enabling efficient local correction of the data distribution and leading to a correct statistics of QoIs $Q(\vec{x})$.
  • Figure 2: Intuition behind our super-resolution sampling proposal: a stored reduced sample enables initialization of a MCMC method in the vicinity of the original sample, which then samples in a local part of the phase space based on the learned energy function $E(\vec{x})$. In practical situations, this approach may not suffer from the slow mixing of MCMC sampling starting from random initial conditions, and highlight the value of the stored reduced data.
  • Figure 3: Maximum element-wise errors of the first and second moments of reconstructed samples computed across different compression levels in synthetic (left) and real (right) datasets with discrete data. Results are averaged over 5 randomly generated models and experiments. The maximum element-wise error means and standard deviations are shown for 4 scenarios: (i) vanilla DCT decompressed samples, (ii) after application of our super-resolution correction (with 10, 12, and 10 MCMC steps for Ising-QoIs, Ising-TV, and D-Wave, respectively), (iii) reconstruction from standard autoencoders, and (iv) reconstruction from regularized autoencoders.
  • Figure 4: Maximum element-wise QoIs errors of reconstructed samples computed across different compression levels in synthetic (left) and real (right) datasets with continuous data. Results are averaged over 5 randomly generated models and experiments. We use the same method as in Figure \ref{['fig:Ising_Dwave']}. In our super-resolution correction part, we use 7, 20, and 50 correction steps for the Multivariate Normal, $\Phi^4$-theory, and Aluminum design experiments, respectively.
  • Figure 5: Dependence of the wall-clock computation time of our approach on the dimensionality of the data for multivariate Normal distributions (left) and D-wave experiments (right). The figure shows the runtime (in seconds) of each component of our algorithm—learning, DCT-based compression/decompression, and MCMC correction—as a function of system size $N$ for the maximum considered compression level $C=0.9$. We empirically find that all three components scale no worse than quadratically with system size in this case. All data points are averaged over 3 model instances, the error bars are estimated through the empirical standard deviation.