Aggressive Post-Training Compression on Extremely Large Language Models

Zining Zhang; Yao Chen; Bingsheng He; Zhenjie Zhang

Aggressive Post-Training Compression on Extremely Large Language Models

Zining Zhang, Yao Chen, Bingsheng He, Zhenjie Zhang

TL;DR

A novel network pruning technology that utilizes over 0.7 sparsity and less than 8 bits of quantization is proposed that enables the compression of prevailing LLMs within a couple of hours while maintaining a relatively small accuracy loss.

Abstract

The increasing size and complexity of Large Language Models (LLMs) pose challenges for their deployment on personal computers and mobile devices. Aggressive post-training model compression is necessary to reduce the models' size, but it often results in significant accuracy loss. To address this challenge, we propose a novel network pruning technology that utilizes over 0.7 sparsity and less than 8 bits of quantization. Our approach enables the compression of prevailing LLMs within a couple of hours while maintaining a relatively small accuracy loss. In experimental evaluations, our method demonstrates effectiveness and potential for practical deployment. By making LLMs available on domestic devices, our work can facilitate a new era of natural language processing applications with wide-ranging impacts.

Aggressive Post-Training Compression on Extremely Large Language Models

TL;DR

Abstract

Paper Structure (13 sections, 12 equations, 5 figures, 3 tables)

This paper contains 13 sections, 12 equations, 5 figures, 3 tables.

Introduction
Background and Related Work
Post-training Model Compression
Hessian-based Weight Update
Layer-adaptive Sparsity
Methodology
Sequentially-Pruning-All Assumption
Sparsity Scheduler
Experiments
Settings
Sparsity vs. Perplexity
Comparing with Naive Layer-order Sparsity Scheduler
Conclusion

Figures (5)

Figure 1: The exponential relationship between sparsity and perplexity in BLOOM-176B using SparseGPT.
Figure 2: Hessian-based prune vs. quantize. Sketched blocks represent the current processing column. Yellow blocks are pruned or quantized weights and blue blocks are those that haven't been compressed. Purple rows are the inverse of corresponding Hessian matrices.
Figure 3: The current row of $\mathbf H^c$ is the approximation of all the possible previous masks. In the top, two different masks are shown. Each mask contains the Hessian updates of previous columns, but they are reflected in $\mathbf H_{j}^{-1, {1:j-1}}$ already.
Figure 4: Score $L_\ell$ plots for different OPT models. They all show an exponential distribution, but OPT-6.7B lacks a flat area at the low-loss region, i.e. the distribution is short-tailed.
Figure 5: Score-based ranking vs. sequential layer orders. The comparison is in our proposed loss estimation. Layer-order contains 6 different lines, since they are formed by 6 different linear modules: QKV projections, the output projection, and two forward connections.

Aggressive Post-Training Compression on Extremely Large Language Models

TL;DR

Abstract

Aggressive Post-Training Compression on Extremely Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (5)