Sparse $L^1$-Autoencoders for Scientific Data Compression
Matthias Chung, Rick Archibald, Paul Atzberger, Jack Michael Solomon
TL;DR
The paper tackles the challenge of efficiently compressing scientific data with high fidelity and minimal artifacts by using an overcomplete latent representation that is sparsified with an L1 penalty on a structured latent map f, forming a rate-distortion objective $min_{theta_e,theta_d} E[(d(e(x;theta_e);theta_d) - x)^2] + \lambda || f(e(x;theta_e)) ||_1$. It demonstrates dramatic compression on SAS data (up to roughly $525x$) while preserving reconstruction quality and shows that latent representations retain discriminative information, as evidenced by MNIST KNN performance near the image-space baseline. The framework is complemented by practical encoding strategies, including index-difference arithmetic coding and weight quantization, yielding additional compression gains and enabling integration into HPC pipelines. Overall, the method provides a dataset-targeted, sparsity‑promoting autoencoder approach that integrates with lossy and lossless compression to support efficient transmission, storage, and analysis of large scientific datasets.
Abstract
Scientific datasets present unique challenges for machine learning-driven compression methods, including more stringent requirements on accuracy and mitigation of potential invalidating artifacts. Drawing on results from compressed sensing and rate-distortion theory, we introduce effective data compression methods by developing autoencoders using high dimensional latent spaces that are $L^1$-regularized to obtain sparse low dimensional representations. We show how these information-rich latent spaces can be used to mitigate blurring and other artifacts to obtain highly effective data compression methods for scientific data. We demonstrate our methods for short angle scattering (SAS) datasets showing they can achieve compression ratios around two orders of magnitude and in some cases better. Our compression methods show promise for use in addressing current bottlenecks in transmission, storage, and analysis in high-performance distributed computing environments. This is central to processing the large volume of SAS data being generated at shared experimental facilities around the world to support scientific investigations. Our approaches provide general ways for obtaining specialized compression methods for targeted scientific datasets.
