Foundation Model for Lossy Compression of Spatiotemporal Scientific Data
Xiao Li, Jaemoon Lee, Anand Rangarajan, Sanjay Ranka
TL;DR
This work tackles the challenge of lossy compression for high-dimensional, variable-physics spatiotemporal scientific data by introducing a foundation model that combines a hyperprior-augmented variational autoencoder with a super-resolution decoder. The approach extends VAEs to 3D to capture spatiotemporal correlations and employs a dedicated SR module in the decoder to enhance reconstruction quality, all while enforcing error guarantees through a block-based PCA residual bound. The FM demonstrates strong generalization to unseen domains and data shapes, achieving up to $4\times$ higher compression ratios after domain-specific fine-tuning and approximately $30\%$ additional gains from the SR component. This framework offers substantial reductions in storage and transmission costs for large-scale simulations while preserving data integrity, with practical implications for HPC workflows and scientific analytics.
Abstract
We present a foundation model (FM) for lossy scientific data compression, combining a variational autoencoder (VAE) with a hyper-prior structure and a super-resolution (SR) module. The VAE framework uses hyper-priors to model latent space dependencies, enhancing compression efficiency. The SR module refines low-resolution representations into high-resolution outputs, improving reconstruction quality. By alternating between 2D and 3D convolutions, the model efficiently captures spatiotemporal correlations in scientific data while maintaining low computational cost. Experimental results demonstrate that the FM generalizes well to unseen domains and varying data shapes, achieving up to 4 times higher compression ratios than state-of-the-art methods after domain-specific fine-tuning. The SR module improves compression ratio by 30 percent compared to simple upsampling techniques. This approach significantly reduces storage and transmission costs for large-scale scientific simulations while preserving data integrity and fidelity.
