Table of Contents
Fetching ...

Morphing-based Compression for Data-centric ML Pipelines

Sebastian Baunsgaard, Matthias Boehm

TL;DR

BWARE addresses the data redundancy inherent in data-centric ML pipelines by extending workload-aware, lossless compression through feature transformations and introducing morphing to tune compressed representations without decompression. It integrates a frame- and matrix-encoding stack, compressed I/O, and a compiler-runtime workflow to dynamically insert compression and morphing into linear-algebra programs. The approach demonstrates substantial end-to-end speedups across a range of datasets and tasks, driven by reusing compressed intermediates, reduced I/O, and workload-aware re-encoding. This work highlights the practical impact of pushing compression through pre-processing and feature engineering to improve data locality and scalability in data-centric ML pipelines, with promising directions for broader transformations and hardware acceleration.

Abstract

Data-centric ML pipelines extend traditional machine learning (ML) pipelines -- of feature transformations and ML model training -- by outer loops for data cleaning, augmentation, and feature engineering to create high-quality input data. Existing lossless matrix compression applies lightweight compression schemes to numeric matrices and performs linear algebra operations such as matrix-vector multiplications directly on the compressed representation but struggles to efficiently rediscover structural data redundancy. Compressed operations are effective at fitting data in available memory, reducing I/O across the storage-memory-cache hierarchy, and improving instruction parallelism. The applied data cleaning, augmentation, and feature transformations provide a rich source of information about data characteristics such as distinct items, column sparsity, and column correlations. In this paper, we introduce BWARE -- an extension of AWARE for workload-aware lossless matrix compression -- that pushes compression through feature transformations and engineering to leverage information about structural transformations. Besides compressed feature transformations, we introduce a novel technique for lightweight morphing of a compressed representation into workload-optimized compressed representations without decompression. BWARE shows substantial end-to-end runtime improvements, reducing the execution time for training data-centric ML pipelines from days to hours.

Morphing-based Compression for Data-centric ML Pipelines

TL;DR

BWARE addresses the data redundancy inherent in data-centric ML pipelines by extending workload-aware, lossless compression through feature transformations and introducing morphing to tune compressed representations without decompression. It integrates a frame- and matrix-encoding stack, compressed I/O, and a compiler-runtime workflow to dynamically insert compression and morphing into linear-algebra programs. The approach demonstrates substantial end-to-end speedups across a range of datasets and tasks, driven by reusing compressed intermediates, reduced I/O, and workload-aware re-encoding. This work highlights the practical impact of pushing compression through pre-processing and feature engineering to improve data locality and scalability in data-centric ML pipelines, with promising directions for broader transformations and hardware acceleration.

Abstract

Data-centric ML pipelines extend traditional machine learning (ML) pipelines -- of feature transformations and ML model training -- by outer loops for data cleaning, augmentation, and feature engineering to create high-quality input data. Existing lossless matrix compression applies lightweight compression schemes to numeric matrices and performs linear algebra operations such as matrix-vector multiplications directly on the compressed representation but struggles to efficiently rediscover structural data redundancy. Compressed operations are effective at fitting data in available memory, reducing I/O across the storage-memory-cache hierarchy, and improving instruction parallelism. The applied data cleaning, augmentation, and feature transformations provide a rich source of information about data characteristics such as distinct items, column sparsity, and column correlations. In this paper, we introduce BWARE -- an extension of AWARE for workload-aware lossless matrix compression -- that pushes compression through feature transformations and engineering to leverage information about structural transformations. Besides compressed feature transformations, we introduce a novel technique for lightweight morphing of a compressed representation into workload-optimized compressed representations without decompression. BWARE shows substantial end-to-end runtime improvements, reducing the execution time for training data-centric ML pipelines from days to hours.

Paper Structure

This paper contains 28 sections, 30 figures, 4 tables, 2 algorithms.

Figures (30)

  • Figure 1: BWARE Framework Overview and Contributions.
  • Figure 2: Relative Number of Distinct Values in ML Datasets. Columns Sorted by the Number of Distinct Values.
  • Figure 3: Lossy Quantization Effect on Values.
  • Figure 4: Output Memory Sizes of One-Hot/Dummy Coding an Input.
  • Figure 5: Relative $d$ Increase when Co-coding Features in Adult: Original and One-Hot Encoded Features.
  • ...and 25 more figures