Table of Contents
Fetching ...

Understanding The Effectiveness of Lossy Compression in Machine Learning Training Sets

Robert Underwood, Jon C. Calhoun, Sheng Di, Franck Cappello

TL;DR

This work addresses the data-volume bottleneck in HPC-enabled ML/AI by developing a systematic methodology to evaluate lossy data reductions across 7 ML/AI applications using 17 data-reduction techniques. The study demonstrates that modern EBLC methods can achieve $50$-$100\times$ compression with around $1\%$ quality loss, identifies per-column value-range relative error bounds as particularly effective for tabular data, and introduces an efficient Pareto-point sampling approach to navigate the multi-objective trade-offs. It analyzes the behavior of lossless, traditional lossy, and error-bounded lossy compressors, providing actionable guidance for practitioners and compressor designers, including how to leverage IO-accelerated transfers and parallel decompression. The findings offer practical impact for data sharing, reproducibility, and cost reduction in HPC-based ML/AI pipelines, and outline directions for targeted compressor design improvements and automated evaluation workflows.

Abstract

Learning and Artificial Intelligence (ML/AI) techniques have become increasingly prevalent in high performance computing (HPC). However, these methods depend on vast volumes of floating point data for training and validation which need methods to share the data on a wide area network (WAN) or to transfer it from edge devices to data centers. Data compression can be a solution to these problems, but an in-depth understanding of how lossy compression affects model quality is needed. Prior work largely considers a single application or compression method. We designed a systematic methodology for evaluating data reduction techniques for ML/AI, and we use it to perform a very comprehensive evaluation with 17 data reduction methods on 7 ML/AI applications to show modern lossy compression methods can achieve a 50-100x compression ratio improvement for a 1% or less loss in quality. We identify critical insights that guide the future use and design of lossy compressors for ML/AI.

Understanding The Effectiveness of Lossy Compression in Machine Learning Training Sets

TL;DR

This work addresses the data-volume bottleneck in HPC-enabled ML/AI by developing a systematic methodology to evaluate lossy data reductions across 7 ML/AI applications using 17 data-reduction techniques. The study demonstrates that modern EBLC methods can achieve - compression with around quality loss, identifies per-column value-range relative error bounds as particularly effective for tabular data, and introduces an efficient Pareto-point sampling approach to navigate the multi-objective trade-offs. It analyzes the behavior of lossless, traditional lossy, and error-bounded lossy compressors, providing actionable guidance for practitioners and compressor designers, including how to leverage IO-accelerated transfers and parallel decompression. The findings offer practical impact for data sharing, reproducibility, and cost reduction in HPC-based ML/AI pipelines, and outline directions for targeted compressor design improvements and automated evaluation workflows.

Abstract

Learning and Artificial Intelligence (ML/AI) techniques have become increasingly prevalent in high performance computing (HPC). However, these methods depend on vast volumes of floating point data for training and validation which need methods to share the data on a wide area network (WAN) or to transfer it from edge devices to data centers. Data compression can be a solution to these problems, but an in-depth understanding of how lossy compression affects model quality is needed. Prior work largely considers a single application or compression method. We designed a systematic methodology for evaluating data reduction techniques for ML/AI, and we use it to perform a very comprehensive evaluation with 17 data reduction methods on 7 ML/AI applications to show modern lossy compression methods can achieve a 50-100x compression ratio improvement for a 1% or less loss in quality. We identify critical insights that guide the future use and design of lossy compressors for ML/AI.
Paper Structure (19 sections, 4 figures, 6 tables, 1 algorithm)

This paper contains 19 sections, 4 figures, 6 tables, 1 algorithm.

Figures (4)

  • Figure 1: Workflow Overview
  • Figure 2: Tuning the Superconductor Dataset
  • Figure 3: Global Pareto Optimal Points Compression Ratio and Quality for Various Applications. Methods omitted when not pareto optimal
  • Figure 4: Distribution of the value Ranges for Candle NT-3. A wide array of values can be observed