Table of Contents
Fetching ...

Guaranteed Conditional Diffusion: 3D Block-based Models for Scientific Data Compression

Jaemoon Lee, Xiao Li, Liangji Zhu, Sanjay Ranka, Anand Rangarajan

TL;DR

This work introduces Guaranteed Conditional Diffusion with Tensor Correction (GCDTC) for lossy scientific data compression, combining 3D block-based encoding with a 2D diffusion denoiser conditioned on slice-wise embeddings and a tensor-correction module to guarantee error bounds. The method deterministically reconstructs data after training and applies a post-processing error guarantee via block-wise PCA to ensure distortion stays within user-defined limits. Experiments on climate (E3SM) and CFD (S3D) datasets show GCDTC outperforms a 3D convolutional autoencoder and achieves competitive compression with SZ at practical NRMSE targets, albeit with slower decoding due to iterative diffusion. The work offers a practical, scalable third paradigm for scientific data compression by leveraging conditional diffusion, 3D conditioning, and explicit error guarantees, with future work aimed at faster decoding and richer block/hyper-block conditioning.

Abstract

This paper proposes a new compression paradigm -- Guaranteed Conditional Diffusion with Tensor Correction (GCDTC) -- for lossy scientific data compression. The framework is based on recent conditional diffusion (CD) generative models, and it consists of a conditional diffusion model, tensor correction, and error guarantee. Our diffusion model is a mixture of 3D conditioning and 2D denoising U-Net. The approach leverages a 3D block-based compressing module to address spatiotemporal correlations in structured scientific data. Then, the reverse diffusion process for 2D spatial data is conditioned on the ``slices'' of content latent variables produced by the compressing module. After training, the denoising decoder reconstructs the data with zero noise and content latent variables, and thus it is entirely deterministic. The reconstructed outputs of the CD model are further post-processed by our tensor correction and error guarantee steps to control and ensure a maximum error distortion, which is an inevitable requirement in lossy scientific data compression. Our experiments involving two datasets generated by climate and chemical combustion simulations show that our framework outperforms standard convolutional autoencoders and yields competitive compression quality with an existing scientific data compression algorithm.

Guaranteed Conditional Diffusion: 3D Block-based Models for Scientific Data Compression

TL;DR

This work introduces Guaranteed Conditional Diffusion with Tensor Correction (GCDTC) for lossy scientific data compression, combining 3D block-based encoding with a 2D diffusion denoiser conditioned on slice-wise embeddings and a tensor-correction module to guarantee error bounds. The method deterministically reconstructs data after training and applies a post-processing error guarantee via block-wise PCA to ensure distortion stays within user-defined limits. Experiments on climate (E3SM) and CFD (S3D) datasets show GCDTC outperforms a 3D convolutional autoencoder and achieves competitive compression with SZ at practical NRMSE targets, albeit with slower decoding due to iterative diffusion. The work offers a practical, scalable third paradigm for scientific data compression by leveraging conditional diffusion, 3D conditioning, and explicit error guarantees, with future work aimed at faster decoding and richer block/hyper-block conditioning.

Abstract

This paper proposes a new compression paradigm -- Guaranteed Conditional Diffusion with Tensor Correction (GCDTC) -- for lossy scientific data compression. The framework is based on recent conditional diffusion (CD) generative models, and it consists of a conditional diffusion model, tensor correction, and error guarantee. Our diffusion model is a mixture of 3D conditioning and 2D denoising U-Net. The approach leverages a 3D block-based compressing module to address spatiotemporal correlations in structured scientific data. Then, the reverse diffusion process for 2D spatial data is conditioned on the ``slices'' of content latent variables produced by the compressing module. After training, the denoising decoder reconstructs the data with zero noise and content latent variables, and thus it is entirely deterministic. The reconstructed outputs of the CD model are further post-processed by our tensor correction and error guarantee steps to control and ensure a maximum error distortion, which is an inevitable requirement in lossy scientific data compression. Our experiments involving two datasets generated by climate and chemical combustion simulations show that our framework outperforms standard convolutional autoencoders and yields competitive compression quality with an existing scientific data compression algorithm.

Paper Structure

This paper contains 26 sections, 10 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Overview of our conditional diffusion model for compression. We compress 3D blocks to capture spatiotemporal correlations in scientific datasets. The latent variables guide a 2D denoising diffusion process. Our denoising decoder reconstructs each of the 2D slices in 3D blocks based on its corresponding latent data $\boldsymbol{z}_i$. This enables us to keep a relatively simple U-Net architecture while getting effective conditioning via 3D block compression.
  • Figure 2: A visualization of our 3D conditional diffusion model. We obtain 3D embedding $\boldsymbol{z}^e$ from the tensor block $\boldsymbol{x}$ to effectively address spatiotemporal correlations in scientific datasets. If the tensor block size is $D\times H\times W$, the first dimension of $\boldsymbol{z}^e$ must be equal to $D$. This is because we incorporate 2D diffusion and the $i^\mathrm{th}$ slice $\boldsymbol{z}_i^e$ is used to condition the denoising decoder. Hence, the decoder learns to predict the 2D noise at the diffusion stage $t$ of each 2D slice $\boldsymbol{x}_i$ in $\boldsymbol{x}$. Architecture details are described in Appendix.
  • Figure 3: Reconstruction quality vs. compression ratio evaluation on E3SM (left) and S3D (right) datasets. GCDTC and GCAE denote Guaranteed Conditional Diffusion with Tensor Correction and Guaranteed Convolutional AutoEncoder. Note that the NRMSE results (y-axis) are plotted on a log scale. The result shows that our GCDTC outperforms GCAE, while yielding competitive performance with SZ.
  • Figure 4: Visualization of reconstructions in E3SM and S3D at compression ratio 100.
  • Figure 5: Illustration of our conditional diffusion model architecture. Numbers above or below the units indicate output channels.