Understanding The Effectiveness of Lossy Compression in Machine Learning Training Sets

Robert Underwood; Jon C. Calhoun; Sheng Di; Franck Cappello

Understanding The Effectiveness of Lossy Compression in Machine Learning Training Sets

Robert Underwood, Jon C. Calhoun, Sheng Di, Franck Cappello

TL;DR

This work addresses the data-volume bottleneck in HPC-enabled ML/AI by developing a systematic methodology to evaluate lossy data reductions across 7 ML/AI applications using 17 data-reduction techniques. The study demonstrates that modern EBLC methods can achieve $50$-$100\times$ compression with around $1\%$ quality loss, identifies per-column value-range relative error bounds as particularly effective for tabular data, and introduces an efficient Pareto-point sampling approach to navigate the multi-objective trade-offs. It analyzes the behavior of lossless, traditional lossy, and error-bounded lossy compressors, providing actionable guidance for practitioners and compressor designers, including how to leverage IO-accelerated transfers and parallel decompression. The findings offer practical impact for data sharing, reproducibility, and cost reduction in HPC-based ML/AI pipelines, and outline directions for targeted compressor design improvements and automated evaluation workflows.

Abstract

Learning and Artificial Intelligence (ML/AI) techniques have become increasingly prevalent in high performance computing (HPC). However, these methods depend on vast volumes of floating point data for training and validation which need methods to share the data on a wide area network (WAN) or to transfer it from edge devices to data centers. Data compression can be a solution to these problems, but an in-depth understanding of how lossy compression affects model quality is needed. Prior work largely considers a single application or compression method. We designed a systematic methodology for evaluating data reduction techniques for ML/AI, and we use it to perform a very comprehensive evaluation with 17 data reduction methods on 7 ML/AI applications to show modern lossy compression methods can achieve a 50-100x compression ratio improvement for a 1% or less loss in quality. We identify critical insights that guide the future use and design of lossy compressors for ML/AI.

Understanding The Effectiveness of Lossy Compression in Machine Learning Training Sets

TL;DR

compression with around

quality loss, identifies per-column value-range relative error bounds as particularly effective for tabular data, and introduces an efficient Pareto-point sampling approach to navigate the multi-objective trade-offs. It analyzes the behavior of lossless, traditional lossy, and error-bounded lossy compressors, providing actionable guidance for practitioners and compressor designers, including how to leverage IO-accelerated transfers and parallel decompression. The findings offer practical impact for data sharing, reproducibility, and cost reduction in HPC-based ML/AI pipelines, and outline directions for targeted compressor design improvements and automated evaluation workflows.

Abstract

Paper Structure (19 sections, 4 figures, 6 tables, 1 algorithm)

This paper contains 19 sections, 4 figures, 6 tables, 1 algorithm.

Introduction
Applications
Data Reduction Techniques Studied
Dimensionality Reduction and Numerosity Reduction
State-of-the-art Data Compression Techniques
Lossless Compression Methods
Traditional Lossy Compression Methods
Error Bounded Lossy Methods
Problem Formalization and Methodology
Experimental Results
Evaluating the Effect on Application Quality and Insights for Compression Development
Plotting Candidate Pareto Points
Global Pareto Optimal Points
What insights does this provide for ML/AI practitioners and compressor designers
Performance Evaluation
...and 4 more sections

Figures (4)

Figure 1: Workflow Overview
Figure 2: Tuning the Superconductor Dataset
Figure 3: Global Pareto Optimal Points Compression Ratio and Quality for Various Applications. Methods omitted when not pareto optimal
Figure 4: Distribution of the value Ranges for Candle NT-3. A wide array of values can be observed

Understanding The Effectiveness of Lossy Compression in Machine Learning Training Sets

TL;DR

Abstract

Understanding The Effectiveness of Lossy Compression in Machine Learning Training Sets

Authors

TL;DR

Abstract

Table of Contents

Figures (4)