Machine Learning Techniques for Data Reduction of CFD Applications

Jaemoon Lee; Ki Sung Jung; Qian Gong; Xiao Li; Scott Klasky; Jacqueline Chen; Anand Rangarajan; Sanjay Ranka

Machine Learning Techniques for Data Reduction of CFD Applications

Jaemoon Lee, Ki Sung Jung, Qian Gong, Xiao Li, Scott Klasky, Jacqueline Chen, Anand Rangarajan, Sanjay Ranka

TL;DR

The paper tackles the challenge of exascale CFD data by delivering a trustworthy, error-bounded data-reduction framework. It introduces a guaranteed block autoencoder (GBATC) that operates on multidimensional tensor blocks, uses a 3D convolutional autoencoder to capture spatiotemporal and interspecies correlations, and employs a tensor correction network plus PCA-based residual projection to guarantee reconstruction errors within a user-defined bound $||x-x^G||_2 \le \tau$, all while applying quantization and entropy coding for compression. The approach demonstrates substantial data reduction—on the order of two to three orders of magnitude—while preserving quality for both primary data and downstream quantities of interest (QoIs), outperforming the SZ baseline on the S3D DNS dataset. These results highlight the method's potential to enable scalable, QoI-preserving data management for CFD and multiphysics simulations. The work also discusses related literature on error-bounded compressors and points toward future enhancements, including extending guarantees to broader QoIs and end-to-end training.

Abstract

We present an approach called guaranteed block autoencoder that leverages Tensor Correlations (GBATC) for reducing the spatiotemporal data generated by computational fluid dynamics (CFD) and other scientific applications. It uses a multidimensional block of tensors (spanning in space and time) for both input and output, capturing the spatiotemporal and interspecies relationship within a tensor. The tensor consists of species that represent different elements in a CFD simulation. To guarantee the error bound of the reconstructed data, principal component analysis (PCA) is applied to the residual between the original and reconstructed data. This yields a basis matrix, which is then used to project the residual of each instance. The resulting coefficients are retained to enable accurate reconstruction. Experimental results demonstrate that our approach can deliver two orders of magnitude in reduction while still keeping the errors of primary data under scientifically acceptable bounds. Compared to reduction-based approaches based on SZ, our method achieves a substantially higher compression ratio for a given error bound or a better error for a given compression ratio.

Machine Learning Techniques for Data Reduction of CFD Applications

TL;DR

, all while applying quantization and entropy coding for compression. The approach demonstrates substantial data reduction—on the order of two to three orders of magnitude—while preserving quality for both primary data and downstream quantities of interest (QoIs), outperforming the SZ baseline on the S3D DNS dataset. These results highlight the method's potential to enable scalable, QoI-preserving data management for CFD and multiphysics simulations. The work also discusses related literature on error-bounded compressors and points toward future enhancements, including extending guarantees to broader QoIs and end-to-end training.

Abstract

Paper Structure (9 sections, 3 equations, 8 figures, 1 algorithm)

This paper contains 9 sections, 3 equations, 8 figures, 1 algorithm.

INTRODUCTION
Methodology
Guaranteed Autoencoder (GAE)
Guaranteed Block Autoencoder (GBA)
GBA with Tensor Correction Network (GBATC)
SZ
Experimental Results
Related Work
Conclusions

Figures (8)

Figure 1: The structure of the autoencoder: Conv3D denotes the 3D convolution layer, Con3DTranspose denotes the 3D transposed convolution layer, FC denotes the fully connected layer, and $h$ denotes the latent space. Leaky ReLU is adopted as the activation function. Each channel in 3D convolution layers processes each species of $S$ species data in the CFD application.
Figure 2: Indices encoding
Figure 3: Guaranteed Block Autoencoder with Tensor Correction Network (GBATC). The AE processes 3D blocks through convolutional layers with $S$ channels and further compresses the block with a fully connected layer as described in Figure \ref{['fig:ae']}. After getting the reconstructed data, we convert the block into a set of vectors. The vectors represent $S$ species data for the specific temporal and spatial points and they are corrected by the tensor correction network. The network learns a mapping from the reconstructed data back to the original data, and it is overcomplete as compression is performed by the AE.
Figure 4: Comparison of a block-based GAE with SZ when the QoIs are $\mathcal{O}(N)$ with only PD guarantees: (a) PD error versus compression ratio, (b) QoI error versus compression ratio. Our approach has high compression ratios because it utilizes the entire tensor along with spatiotemporal relationships.
Figure 5: Temporal evolution ($t$ = 1.5, 1.8, and 2.0 ms) of the (left half) mass fraction and (right half) formation rate of H$_2$O as predicted by (first row) DNS, (second row) GBATC, (third row) GBA, and (last row) SZ. The compression ratios for all the results are 400.
...and 3 more figures

Machine Learning Techniques for Data Reduction of CFD Applications

TL;DR

Abstract

Machine Learning Techniques for Data Reduction of CFD Applications

Authors

TL;DR

Abstract

Table of Contents

Figures (8)