Table of Contents
Fetching ...

When Compression Meets Model Compression: Memory-Efficient Double Compression for Large Language Models

Weilan Wang, Yu Mao, Dongdong Tang, Hongchao Du, Nan Guan, Chun Jason Xue

TL;DR

This work tackles the memory bottleneck of deploying billion-parameter LLMs on memory-limited devices by introducing a double compression framework that couples compression-aware quantization with pruning and a lossless weight compressor, plus a runtime, speed-adaptive decompression strategy. The approach, supported by an extensive evaluation across multiple models and tasks, achieves about a 2.2x reduction in compressed model size with roughly 1% accuracy loss and about 40% memory savings during inference, while mitigating decompression overhead. Key innovations include per-channel activation-guided weight scaling to enhance weight compressibility, pruning using activation-based l_infty norms, and an ANS-based lossless compression layer, all orchestrated with a partial-compression technique to balance memory and speed. The results suggest a practical path to deploying large language models on devices with constrained memory by substantially reducing memory footprint without compromising performance.

Abstract

Large language models (LLMs) exhibit excellent performance in various tasks. However, the memory requirements of LLMs present a great challenge when deploying on memory-limited devices, even for quantized LLMs. This paper introduces a framework to compress LLM after quantization further, achieving about 2.2x compression ratio. A compression-aware quantization is first proposed to enhance model weight compressibility by re-scaling the model parameters before quantization, followed by a pruning method to improve further. Upon this, we notice that decompression can be a bottleneck during practical scenarios. We then give a detailed analysis of the trade-off between memory usage and latency brought by the proposed method. A speed-adaptive method is proposed to overcome it. The experimental results show inference with the compressed model can achieve a 40% reduction in memory size with negligible loss in accuracy and inference speed.

When Compression Meets Model Compression: Memory-Efficient Double Compression for Large Language Models

TL;DR

This work tackles the memory bottleneck of deploying billion-parameter LLMs on memory-limited devices by introducing a double compression framework that couples compression-aware quantization with pruning and a lossless weight compressor, plus a runtime, speed-adaptive decompression strategy. The approach, supported by an extensive evaluation across multiple models and tasks, achieves about a 2.2x reduction in compressed model size with roughly 1% accuracy loss and about 40% memory savings during inference, while mitigating decompression overhead. Key innovations include per-channel activation-guided weight scaling to enhance weight compressibility, pruning using activation-based l_infty norms, and an ANS-based lossless compression layer, all orchestrated with a partial-compression technique to balance memory and speed. The results suggest a practical path to deploying large language models on devices with constrained memory by substantially reducing memory footprint without compromising performance.

Abstract

Large language models (LLMs) exhibit excellent performance in various tasks. However, the memory requirements of LLMs present a great challenge when deploying on memory-limited devices, even for quantized LLMs. This paper introduces a framework to compress LLM after quantization further, achieving about 2.2x compression ratio. A compression-aware quantization is first proposed to enhance model weight compressibility by re-scaling the model parameters before quantization, followed by a pruning method to improve further. Upon this, we notice that decompression can be a bottleneck during practical scenarios. We then give a detailed analysis of the trade-off between memory usage and latency brought by the proposed method. A speed-adaptive method is proposed to overcome it. The experimental results show inference with the compressed model can achieve a 40% reduction in memory size with negligible loss in accuracy and inference speed.

Paper Structure

This paper contains 16 sections, 6 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: Activation, weight and scaled-weight data distribution of OPT-1.3B Model. The left displays the data distribution for every layer before quantization. Points above the upper edge lines are outliers. The data distribution after quantization is on the right.
  • Figure 2: Overview of Double Compression. The LLM weights are scaled, quantized, pruned, and compressed. The inference throughput is analyzed for adaptive compression of model weights.
  • Figure 3: The double compression method first scales the weight using per-channel activation maximum value. Then, INT 8 quantization is applied to compress the weight, followed by pruning using score.
  • Figure 4: System architectures for loading LLM. (a).The universal LLM loading method. (b).Compressed model size $<$ GPU memory capacity. There is a decompression buffer to store the decompression results. (c).GPU memory capacity $<$ Compressed model size $<$ GPU+CPU memory capacity. (d).GPU+CPU memory capacity $<$ Compressed model size. The decompression buffer is allocated in GPU memory to store the uncompressed data.
  • Figure 5: Speed Adaptive Compression. The LLM data is divided into data blocks containing several chunks. For every data block, the last chunk is selected for compression. The decompression speed can be increased by $Block_{size}/Chunk_{size}$ times at the cost of compression ratio loss.
  • ...and 5 more figures