Table of Contents
Fetching ...

Accurate Performance Modeling And Uncertainty Analysis of Lossy Compression in Scientific Applications

Youyuan Liu, Taolue Yang, Sian Jin

TL;DR

The paper tackles the challenge of predicting compression time for prediction-based lossy compression in large-scale scientific data. It introduces a four-stage decomposition of the compression workflow, with stage-specific surrogate models and offline tuning to enable fast, generalizable predictions. A dedicated uncertainty analysis separates system and algorithm uncertainty, modeling both as normal distributions and combining them to produce a $95\%$ confidence interval; results show an average prediction error of about $5\%$ across six datasets. The work enables time-aware scheduling and resource allocation in HPC workflows, with practical impact on reducing wait times and improving throughput in data-intensive scientific simulations.

Abstract

Scientific applications typically generate large volumes of floating-point data, making lossy compression one of the most effective methods for data reduction, thereby lowering storage requirements and improving performance in large-scale applications. However, variations in compression time can significantly impact overall performance improvement, due to inaccurate scheduling, workload imbalances, etc. Existing approaches rely on empirical methods to predict the compression performance, which often lack interpretability and suffer from limitations in accuracy and generalizability. In this paper, we propose surrogate models for predicting the compression time of prediction-based lossy compression and provide a detailed analysis of the factors influencing time variability with uncertainty analysis. Our evaluation shows that our solution can accuratly predict the compression time with 5% average error across six scientific datasets. It also provides accurate 95% confidence interval, which is essential for time-sensitive scheduling and applications.

Accurate Performance Modeling And Uncertainty Analysis of Lossy Compression in Scientific Applications

TL;DR

The paper tackles the challenge of predicting compression time for prediction-based lossy compression in large-scale scientific data. It introduces a four-stage decomposition of the compression workflow, with stage-specific surrogate models and offline tuning to enable fast, generalizable predictions. A dedicated uncertainty analysis separates system and algorithm uncertainty, modeling both as normal distributions and combining them to produce a confidence interval; results show an average prediction error of about across six datasets. The work enables time-aware scheduling and resource allocation in HPC workflows, with practical impact on reducing wait times and improving throughput in data-intensive scientific simulations.

Abstract

Scientific applications typically generate large volumes of floating-point data, making lossy compression one of the most effective methods for data reduction, thereby lowering storage requirements and improving performance in large-scale applications. However, variations in compression time can significantly impact overall performance improvement, due to inaccurate scheduling, workload imbalances, etc. Existing approaches rely on empirical methods to predict the compression performance, which often lack interpretability and suffer from limitations in accuracy and generalizability. In this paper, we propose surrogate models for predicting the compression time of prediction-based lossy compression and provide a detailed analysis of the factors influencing time variability with uncertainty analysis. Our evaluation shows that our solution can accuratly predict the compression time with 5% average error across six scientific datasets. It also provides accurate 95% confidence interval, which is essential for time-sensitive scheduling and applications.

Paper Structure

This paper contains 23 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Compression throughput varies significantly with different bitrate and datasets. Internally, the throughput is stacked by different stages based on their execution time.
  • Figure 2: Case 1: to align 1 byte, when the current code is short enough and doesn't need to write new bytes. Case 2: when the current code exceeds the remaining length of the current byte, it will fill the remaining length and then start a new byte. Case 3: the current byte is empty.
  • Figure 3: Overview of the prediction error for various fields in different datasets
  • Figure 4: Comparison between real and predicted time across bitrate.
  • Figure 5: The algorithm and system uncertainty distribution and its corresponding gamma distribution and normal distribution fit for CESM dataset and SCALE dataset.