Accurate Performance Modeling And Uncertainty Analysis of Lossy Compression in Scientific Applications
Youyuan Liu, Taolue Yang, Sian Jin
TL;DR
The paper tackles the challenge of predicting compression time for prediction-based lossy compression in large-scale scientific data. It introduces a four-stage decomposition of the compression workflow, with stage-specific surrogate models and offline tuning to enable fast, generalizable predictions. A dedicated uncertainty analysis separates system and algorithm uncertainty, modeling both as normal distributions and combining them to produce a $95\%$ confidence interval; results show an average prediction error of about $5\%$ across six datasets. The work enables time-aware scheduling and resource allocation in HPC workflows, with practical impact on reducing wait times and improving throughput in data-intensive scientific simulations.
Abstract
Scientific applications typically generate large volumes of floating-point data, making lossy compression one of the most effective methods for data reduction, thereby lowering storage requirements and improving performance in large-scale applications. However, variations in compression time can significantly impact overall performance improvement, due to inaccurate scheduling, workload imbalances, etc. Existing approaches rely on empirical methods to predict the compression performance, which often lack interpretability and suffer from limitations in accuracy and generalizability. In this paper, we propose surrogate models for predicting the compression time of prediction-based lossy compression and provide a detailed analysis of the factors influencing time variability with uncertainty analysis. Our evaluation shows that our solution can accuratly predict the compression time with 5% average error across six scientific datasets. It also provides accurate 95% confidence interval, which is essential for time-sensitive scheduling and applications.
