SparseLLM: Towards Global Pruning for Pre-trained Language Models

Guangji Bai; Yijiang Li; Chen Ling; Kibaek Kim; Liang Zhao

SparseLLM: Towards Global Pruning for Pre-trained Language Models

Guangji Bai, Yijiang Li, Chen Ling, Kibaek Kim, Liang Zhao

TL;DR

SarseLLM is proposed, a novel framework that redefines the global pruning process into manageable, coordinated subproblems, allowing for resource-efficient optimization with global optimality.

Abstract

The transformative impact of large language models (LLMs) like LLaMA and GPT on natural language processing is countered by their prohibitive computational demands. Pruning has emerged as a pivotal compression strategy, introducing sparsity to enhance both memory and computational efficiency. Yet, traditional global pruning is impractical for LLMs due to scalability issues, while local pruning, despite its efficiency, leads to suboptimal solutions. Addressing these challenges, we propose SparseLLM, a novel framework that redefines the global pruning process into manageable, coordinated subproblems, allowing for resource-efficient optimization with global optimality. SparseLLM's approach, which conceptualizes LLMs as a chain of modular functions and leverages auxiliary variables for problem decomposition, not only facilitates a pragmatic application on LLMs but also demonstrates significant performance improvements, particularly in high-sparsity regimes where it surpasses current state-of-the-art methods.

SparseLLM: Towards Global Pruning for Pre-trained Language Models

TL;DR

SarseLLM is proposed, a novel framework that redefines the global pruning process into manageable, coordinated subproblems, allowing for resource-efficient optimization with global optimality.

Abstract

Paper Structure (27 sections, 13 equations, 6 figures, 10 tables, 1 algorithm)

This paper contains 27 sections, 13 equations, 6 figures, 10 tables, 1 algorithm.

Introduction
Related work
Background and notation
Global pruning
Local pruning
Mask selection & weight reconstruction.
Existing solvers.
What is wrong with local pruning?
SparseLLM: Towards global pruning for LLMs
Motivation
A unified formulation of pruning
Algorithm design
SparseLLM on OPT models
SparseLLM on LlaMA models
Pruning of MHAs
...and 12 more sections

Figures (6)

Figure 1: SparseLLM decomposes the global pruning of LLMs into manageable subproblems by leveraging the chain of modules and auxiliary variables while maintaining dependencies.
Figure 2: Illustration of SparseLLM on OPT and LlaMA. The auxiliary variables and soft constraints (i.e., $\approx$) allow SparseLLM to decompose the global pruning into manageable subproblems while maintaining the dependencies. Subproblems are analytically solvable and enjoy fast convergence.
Figure 3: Fast convergence of SparseLLM. Training loss per epoch for pruning layer 3 of OPT-125m at 80% sparsity (Left) and layer 6 of LlaMA-2 13b at 70% sparsity (Right).
Figure 4: Illustration of SparseLLM pruning method compared to conventional global pruning and local pruning. We consider a two-layer neural network as an abstraction for simplicity. Global pruning (left) is memory prohibitive due to poor scalability. Local pruning (mid) considers pruning each layer independently, while inevitably sacrificing performance due to the ignorance of global supervision. Our adaptive global pruning (right) achieves global pruning with low memory cost by leveraging auxiliary variables and soft constraints.
Figure 5: Sensitivity of OPT-2.7b on the calibration sample sizes for datasets PTB and C4.
...and 1 more figures

Theorems & Definitions (2)

Remark 4.1: Generality and flexibility of Eq. \ref{['eq: prune general formulation']}
Remark 4.2: Global convergence of SparseLLM

SparseLLM: Towards Global Pruning for Pre-trained Language Models

TL;DR

Abstract

SparseLLM: Towards Global Pruning for Pre-trained Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (6)

Theorems & Definitions (2)