Machine Learning Techniques for Data Reduction of Climate Applications
Xiao Li, Qian Gong, Jaemoon Lee, Scott Klasky, Anand Rangarajan, Sanjay Ranka
TL;DR
This work tackles data reduction for spatiotemporal climate data by preserving binary QoI masks (presence/absence of phenomena) while maximizing compression. It introduces a two-stage pipeline: (1) ROI detection with a UNet that generates region masks or probability maps for climate events, including heatmaps for rare TC occurrences; (2) a Guaranteed Autoencoder (GAE) that performs differential, error-bounded compression by processing data in spatiotemporal blocks, enforcing a per-patch $\|x-x^G\|_2 \le \tau$, with the residual projected onto a PCA basis $U$ and coefficients quantized with Huffman coding. The method integrates 3D convolution to capture spatiotemporal correlations, partitions data into ROI, buffer, and non-ROI zones for tailored error bounds, and uses MGARD as a strong baseline for comparison. Empirical results on E3SM climate data show significantly higher compression ratios and lower false-negative rates for TC and AR detection than prior region-based methods, while preserving QoI with comparable or better accuracy. This approach enables scalable climate data reductions suitable for large-scale simulations and archives without compromising downstream QoI analyses.
Abstract
Scientists conduct large-scale simulations to compute derived quantities-of-interest (QoI) from primary data. Often, QoI are linked to specific features, regions, or time intervals, such that data can be adaptively reduced without compromising the integrity of QoI. For many spatiotemporal applications, these QoI are binary in nature and represent presence or absence of a physical phenomenon. We present a pipelined compression approach that first uses neural-network-based techniques to derive regions where QoI are highly likely to be present. Then, we employ a Guaranteed Autoencoder (GAE) to compress data with differential error bounds. GAE uses QoI information to apply low-error compression to only these regions. This results in overall high compression ratios while still achieving downstream goals of simulation or data collections. Experimental results are presented for climate data generated from the E3SM Simulation model for downstream quantities such as tropical cyclone and atmospheric river detection and tracking. These results show that our approach is superior to comparable methods in the literature.
