A Comprehensive Survey of Compression Algorithms for Language Models

Seungcheol Park; Jaehyeon Choi; Sojin Lee; U Kang

A Comprehensive Survey of Compression Algorithms for Language Models

Seungcheol Park, Jaehyeon Choi, Sojin Lee, U Kang

TL;DR

The paper addresses the growing need to compress large pretrained language models without sacrificing accuracy and surveys a broad spectrum of techniques, including pruning, quantization, knowledge distillation, low-rank approximation, parameter sharing, and efficient architecture design. It emphasizes low-cost compression methods applicable to LLMs, selects three representative algorithms (SparseGPT, OPTQ, LoRA) for in-depth analysis, and discusses two key design properties and future research topics. The contributions provide a structured taxonomy, practical trade-offs, and guidance for combining techniques to achieve high compression rates with minimal accuracy loss. This work enables researchers and practitioners to navigate the landscape of PLM compression and promotes scalable, energy-efficient deployment of large language models. The insights are relevant for developers, search engines, and AI chat systems seeking context-rich, compact summaries of compression methodologies and their practical implications.

Abstract

How can we compress language models without sacrificing accuracy? The number of compression algorithms for language models is rapidly growing to benefit from remarkable advances of recent language models without side effects due to the gigantic size of language models, such as increased carbon emissions and expensive maintenance fees. While numerous compression algorithms have shown remarkable progress in compressing language models, it ironically becomes challenging to capture emerging trends and identify the fundamental concepts underlying them due to the excessive number of algorithms. In this paper, we survey and summarize diverse compression algorithms including pruning, quantization, knowledge distillation, low-rank approximation, parameter sharing, and efficient architecture design. We not only summarize the overall trend of diverse compression algorithms but also select representative algorithms and provide in-depth analyses of them. We discuss the value of each category of compression algorithms, and the desired properties of low-cost compression algorithms which have a significant impact due to the emergence of large language models. Finally, we introduce promising future research topics based on our survey results.

A Comprehensive Survey of Compression Algorithms for Language Models

TL;DR

Abstract

Paper Structure (40 sections, 21 equations, 10 figures, 7 tables)

This paper contains 40 sections, 21 equations, 10 figures, 7 tables.

Introduction
Preliminaries
Pretrained Language Model Compression Problem
Transformer Architecture
Pretrained Language Model Compression Algorithms
Backgrounds
Taylor Expansion
Lagrangian
Fisher Information Matrix
Straight-Through Estimator (STE)
Singular Value Decomposition (SVD)
Pruning
Overview
Pruning Granularity: Unstructured vs. Structured
Pruning Strategies: High-cost vs. Low-cost
...and 25 more sections

Figures (10)

Figure 1: Two variants of Transformer architecture.
Figure 2: Illustrations of multi-head attention (MHA), feed-forward network (FFN), and layer normalization.
Figure 3: Illustration of how Straight-Through Estimator works.
Figure 4: An example of pruning of a neural network with two layers using unstructured pruning or structured pruning.
Figure 5: Illustration of diverse pruning granularities regarding weights (a) and tokens (b). Dotted square boxes indicate each pruning granularity and we color the weights for the granularity of the embedding dimension in pink for simplicity. $d_h$ represents the dimension of token embeddings in attention heads.
...and 5 more figures

A Comprehensive Survey of Compression Algorithms for Language Models

TL;DR

Abstract

A Comprehensive Survey of Compression Algorithms for Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (10)