Table of Contents
Fetching ...

Sparse $L^1$-Autoencoders for Scientific Data Compression

Matthias Chung, Rick Archibald, Paul Atzberger, Jack Michael Solomon

TL;DR

The paper tackles the challenge of efficiently compressing scientific data with high fidelity and minimal artifacts by using an overcomplete latent representation that is sparsified with an L1 penalty on a structured latent map f, forming a rate-distortion objective $min_{theta_e,theta_d} E[(d(e(x;theta_e);theta_d) - x)^2] + \lambda || f(e(x;theta_e)) ||_1$. It demonstrates dramatic compression on SAS data (up to roughly $525x$) while preserving reconstruction quality and shows that latent representations retain discriminative information, as evidenced by MNIST KNN performance near the image-space baseline. The framework is complemented by practical encoding strategies, including index-difference arithmetic coding and weight quantization, yielding additional compression gains and enabling integration into HPC pipelines. Overall, the method provides a dataset-targeted, sparsity‑promoting autoencoder approach that integrates with lossy and lossless compression to support efficient transmission, storage, and analysis of large scientific datasets.

Abstract

Scientific datasets present unique challenges for machine learning-driven compression methods, including more stringent requirements on accuracy and mitigation of potential invalidating artifacts. Drawing on results from compressed sensing and rate-distortion theory, we introduce effective data compression methods by developing autoencoders using high dimensional latent spaces that are $L^1$-regularized to obtain sparse low dimensional representations. We show how these information-rich latent spaces can be used to mitigate blurring and other artifacts to obtain highly effective data compression methods for scientific data. We demonstrate our methods for short angle scattering (SAS) datasets showing they can achieve compression ratios around two orders of magnitude and in some cases better. Our compression methods show promise for use in addressing current bottlenecks in transmission, storage, and analysis in high-performance distributed computing environments. This is central to processing the large volume of SAS data being generated at shared experimental facilities around the world to support scientific investigations. Our approaches provide general ways for obtaining specialized compression methods for targeted scientific datasets.

Sparse $L^1$-Autoencoders for Scientific Data Compression

TL;DR

The paper tackles the challenge of efficiently compressing scientific data with high fidelity and minimal artifacts by using an overcomplete latent representation that is sparsified with an L1 penalty on a structured latent map f, forming a rate-distortion objective . It demonstrates dramatic compression on SAS data (up to roughly ) while preserving reconstruction quality and shows that latent representations retain discriminative information, as evidenced by MNIST KNN performance near the image-space baseline. The framework is complemented by practical encoding strategies, including index-difference arithmetic coding and weight quantization, yielding additional compression gains and enabling integration into HPC pipelines. Overall, the method provides a dataset-targeted, sparsity‑promoting autoencoder approach that integrates with lossy and lossless compression to support efficient transmission, storage, and analysis of large scientific datasets.

Abstract

Scientific datasets present unique challenges for machine learning-driven compression methods, including more stringent requirements on accuracy and mitigation of potential invalidating artifacts. Drawing on results from compressed sensing and rate-distortion theory, we introduce effective data compression methods by developing autoencoders using high dimensional latent spaces that are -regularized to obtain sparse low dimensional representations. We show how these information-rich latent spaces can be used to mitigate blurring and other artifacts to obtain highly effective data compression methods for scientific data. We demonstrate our methods for short angle scattering (SAS) datasets showing they can achieve compression ratios around two orders of magnitude and in some cases better. Our compression methods show promise for use in addressing current bottlenecks in transmission, storage, and analysis in high-performance distributed computing environments. This is central to processing the large volume of SAS data being generated at shared experimental facilities around the world to support scientific investigations. Our approaches provide general ways for obtaining specialized compression methods for targeted scientific datasets.
Paper Structure (5 sections, 3 equations, 6 figures, 1 table)

This paper contains 5 sections, 3 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: Overcomplete autoencoder architecture, where the latent space dimension $\ell$ is bigger than the input dimension $n$. Sparsity on the latent variable is imposed via $L^1$-type regularization.
  • Figure 2: Autoencoder architecture for numerical examples. Network is a fully connected five layer symmetric neural network with hidden layer size $(m,\ell,m)$, $\mathop{\mathrm{\text{ReLU}}}\limits$ activation function between each layer, and input/output size $n$.
  • Figure 3: We show a representative testing input $x_j$ for the SAS application in the first column, while its reconstructions using sparse autoencoder networks $(1,2,10,z)$ on the top and $(1,2,10,\nabla z)$ on the bottom are presented in the second column. The reconstruction errors $\left\|d(e(x_j; \theta_e); \theta_d) - x\right\|_{2}$ are $3.43\times 10^{-3}$ and $8.15\times 10^{-4}$, respectively. We show the latent space variable $z_j = e(x_j; \theta_e)$ in the third column. The latent variable $z_j$ each contains 34 and 26 non-zero elements. With an original image size of $64\times 64$ the resulting compression ratio $\left\|x_j\right\|_{0}$ to $\left\|e(x_j; \theta_e)\right\|_{0}$ and $\left\|x_j\right\|_{0}$ to $\left\|\nabla e(x_j; \theta_e)\right\|_{0}$ are $120:1$ and $157:1$.
  • Figure 4: We show the error and compression rate partial distribution functions (PDFs) for the sparse autoencoder networks $(1,2,10,z)$ on the top and $(1,2,10,\nabla z)$ on the bottom. The top left plot displays the distributions of the entire training dataset, which consists of $50,\!000$ random SAS images generated by SASView. The bottom left plot displays the distributions in the prediction of $150,\!000$ independently random SAS images generated by SASView post-training. The mean training errors of both approaches are $2.48\times 10^{-3}$ and $8.00 \times 10^{-4}$, respectively. Correspondingly, the mean training compression rates are $153\times$ and $216\times$. Note that these values do only alter insignificantly for the testing set.
  • Figure 5: For further compression of $z$ with our arithmetic entropy encoding, we show the distribution of index differences $\rho(\delta_k)$ for our representation $z \rightarrow (\delta,w)$ (left). For the SAS scattering data, we show the further compression reductions in percentage obtained for the index differences $\delta$ (right).
  • ...and 1 more figures