Table of Contents
Fetching ...

Lossless and Near-Lossless Compression for Foundation Models

Moshik Hershcovitch, Leshem Choshen, Andrew Wood, Ilias Enmouri, Peter Chin, Swaminathan Sundararaman, Danny Harnik

TL;DR

This work investigates lossless compression of foundation-model weights with a decompression path back to the original size, showing meaningful network and storage reductions and occasional >50% savings. It introduces Byte Grouping to align bytes by position across parameters and a tunable lossy variant that preserves accuracy while substantially increasing compression, plus delta compression for related models. The authors implement PyTorch integrations and validate with model-hub transfers, demonstrating real-world end-to-end timing gains, especially on highly compressible models, and quantify gains in gradients/optimizers as well. Collectively, the results argue for making lossless compression a default in model hubs and highlight practical approaches to further shrink traffic and storage in large-scale model ecosystems.

Abstract

With the growth of model sizes and scale of their deployment, their sheer size burdens the infrastructure requiring more network and more storage to accommodate these. While there is a vast literature about reducing model sizes, we investigate a more traditional type of compression -- one that compresses the model to a smaller form and is coupled with a decompression algorithm that returns it to its original size -- namely lossless compression. Somewhat surprisingly, we show that such lossless compression can gain significant network and storage reduction on popular models, at times reducing over $50\%$ of the model size. We investigate the source of model compressibility, introduce compression variants tailored for models and categorize models to compressibility groups. We also introduce a tunable lossy compression technique that can further reduce size even on the less compressible models with little to no effect on the model accuracy. We estimate that these methods could save over an ExaByte per month of network traffic downloaded from a large model hub like HuggingFace.

Lossless and Near-Lossless Compression for Foundation Models

TL;DR

This work investigates lossless compression of foundation-model weights with a decompression path back to the original size, showing meaningful network and storage reductions and occasional >50% savings. It introduces Byte Grouping to align bytes by position across parameters and a tunable lossy variant that preserves accuracy while substantially increasing compression, plus delta compression for related models. The authors implement PyTorch integrations and validate with model-hub transfers, demonstrating real-world end-to-end timing gains, especially on highly compressible models, and quantify gains in gradients/optimizers as well. Collectively, the results argue for making lossless compression a default in model hubs and highlight practical approaches to further shrink traffic and storage in large-scale model ecosystems.

Abstract

With the growth of model sizes and scale of their deployment, their sheer size burdens the infrastructure requiring more network and more storage to accommodate these. While there is a vast literature about reducing model sizes, we investigate a more traditional type of compression -- one that compresses the model to a smaller form and is coupled with a decompression algorithm that returns it to its original size -- namely lossless compression. Somewhat surprisingly, we show that such lossless compression can gain significant network and storage reduction on popular models, at times reducing over of the model size. We investigate the source of model compressibility, introduce compression variants tailored for models and categorize models to compressibility groups. We also introduce a tunable lossy compression technique that can further reduce size even on the less compressible models with little to no effect on the model accuracy. We estimate that these methods could save over an ExaByte per month of network traffic downloaded from a large model hub like HuggingFace.
Paper Structure (32 sections, 12 figures, 3 tables)

This paper contains 32 sections, 12 figures, 3 tables.

Figures (12)

  • Figure 1: FP32, a sign bit + 8 bits exponents + 23 bits of mantissa. BF16, a sign bit + 8 bits exponents + 7 bits of mantissa. FP16, a sign bit + 5 bits exponents + 10 bits of mantissa
  • Figure 2: An example for Byte Grouping, each parameter has 4 bytes and we group them into 4 arrays.
  • Figure 3: Fine tuned RoBERTa compression and accuracy as a function of the precision factor parameter b (i.e., for $b=27$ the factor is $B=2^{27}$. The first two values are lossless compression without and with byte grouping.
  • Figure 4: Download and upload times of 3 models using full model compression vs. the non-compressed version.
  • Figure 5: Breakdown of download time for the wav2vec with a 30MBps network.
  • ...and 7 more figures