Table of Contents
Fetching ...

NeuZip: Memory-Efficient Training and Inference with Dynamic Compression of Neural Networks

Yongchang Hao, Yanshuai Cao, Lili Mou

TL;DR

This work introduces NeuZip, a new weight compression scheme based on the entropy of floating-point numbers in neural networks that can reduce memory usage by more than half while maintaining near-lossless performance in inference.

Abstract

The performance of neural networks improves when more parameters are used. However, the model sizes are constrained by the available on-device memory during training and inference. Although applying techniques like quantization can alleviate the constraint, they suffer from performance degradation. In this work, we introduce NeuZip, a new weight compression scheme based on the entropy of floating-point numbers in neural networks. With NeuZip, we are able to achieve memory-efficient training and inference without sacrificing performance. Notably, we significantly reduce the memory footprint of training a Llama-3 8B model from 31GB to less than 16GB, while keeping the training dynamics fully unchanged. In inference, our method can reduce memory usage by more than half while maintaining near-lossless performance. Our code is publicly available.

NeuZip: Memory-Efficient Training and Inference with Dynamic Compression of Neural Networks

TL;DR

This work introduces NeuZip, a new weight compression scheme based on the entropy of floating-point numbers in neural networks that can reduce memory usage by more than half while maintaining near-lossless performance in inference.

Abstract

The performance of neural networks improves when more parameters are used. However, the model sizes are constrained by the available on-device memory during training and inference. Although applying techniques like quantization can alleviate the constraint, they suffer from performance degradation. In this work, we introduce NeuZip, a new weight compression scheme based on the entropy of floating-point numbers in neural networks. With NeuZip, we are able to achieve memory-efficient training and inference without sacrificing performance. Notably, we significantly reduce the memory footprint of training a Llama-3 8B model from 31GB to less than 16GB, while keeping the training dynamics fully unchanged. In inference, our method can reduce memory usage by more than half while maintaining near-lossless performance. Our code is publicly available.

Paper Structure

This paper contains 36 sections, 7 equations, 9 figures, 4 tables, 1 algorithm.

Figures (9)

  • Figure 1: The histograms of different components of the parameters of LLama-3 8B model dubey2024llama. The $x$-axis is all possible binary values and the $y$-axis represent the frequency of each value.
  • Figure 2: Reverse-mode automatic differentiation (e.g., back-propagation) with different memory-saving techniques for a linear layer. Blocks colored blue are loaded in memory temporarily for the calculation of this layer, whereas the blocks colored red are always in memory throughout training.
  • Figure 3: The trade-off between memory and performance for different methods.
  • Figure 4: The throughput experiment. (a) Comparison of CPU-offloading, quantization, lossy NeuZip compression, and lossless NeuZip compression. (b) Comparison of GPU-reloading, de-quantization, lossy NeuZip decompression, and lossless NeuZip decompression.
  • Figure 5: The histograms of different floating-point components of the parameters of a randomly initialized Llama-3 8B model.
  • ...and 4 more figures