Table of Contents
Fetching ...

DISP-LLM: Dimension-Independent Structural Pruning for Large Language Models

Shangqian Gao, Chi-Heng Lin, Ting Hua, Tang Zheng, Yilin Shen, Hongxia Jin, Yen-Chang Hsu

TL;DR

Experimental results demonstrate that the proposed dimension-independent structural pruning method outperforms other state-of-the-art methods, showing for the first time that structural pruning can achieve an accuracy similar to semi-structural pruning.

Abstract

Large Language Models (LLMs) have achieved remarkable success in various natural language processing tasks, including language modeling, understanding, and generation. However, the increased memory and computational costs associated with these models pose significant challenges for deployment on resource-limited devices. Structural pruning has emerged as a promising solution to reduce the costs of LLMs without requiring post-processing steps. Prior structural pruning methods either follow the dependence of structures at the cost of limiting flexibility, or introduce non-trivial additional parameters by incorporating different projection matrices. In this work, we propose a novel approach that relaxes the constraint imposed by regular structural pruning methods and eliminates the structural dependence along the embedding dimension. Our dimension-independent structural pruning method offers several benefits. Firstly, our method enables different blocks to utilize different subsets of the feature maps. Secondly, by removing structural dependence, we facilitate each block to possess varying widths along its input and output dimensions, thereby significantly enhancing the flexibility of structural pruning. We evaluate our method on various LLMs, including OPT, LLaMA, LLaMA-2, Phi-1.5, and Phi-2. Experimental results demonstrate that our approach outperforms other state-of-the-art methods, showing for the first time that structural pruning can achieve an accuracy similar to semi-structural pruning.

DISP-LLM: Dimension-Independent Structural Pruning for Large Language Models

TL;DR

Experimental results demonstrate that the proposed dimension-independent structural pruning method outperforms other state-of-the-art methods, showing for the first time that structural pruning can achieve an accuracy similar to semi-structural pruning.

Abstract

Large Language Models (LLMs) have achieved remarkable success in various natural language processing tasks, including language modeling, understanding, and generation. However, the increased memory and computational costs associated with these models pose significant challenges for deployment on resource-limited devices. Structural pruning has emerged as a promising solution to reduce the costs of LLMs without requiring post-processing steps. Prior structural pruning methods either follow the dependence of structures at the cost of limiting flexibility, or introduce non-trivial additional parameters by incorporating different projection matrices. In this work, we propose a novel approach that relaxes the constraint imposed by regular structural pruning methods and eliminates the structural dependence along the embedding dimension. Our dimension-independent structural pruning method offers several benefits. Firstly, our method enables different blocks to utilize different subsets of the feature maps. Secondly, by removing structural dependence, we facilitate each block to possess varying widths along its input and output dimensions, thereby significantly enhancing the flexibility of structural pruning. We evaluate our method on various LLMs, including OPT, LLaMA, LLaMA-2, Phi-1.5, and Phi-2. Experimental results demonstrate that our approach outperforms other state-of-the-art methods, showing for the first time that structural pruning can achieve an accuracy similar to semi-structural pruning.

Paper Structure

This paper contains 26 sections, 2 theorems, 11 equations, 13 figures, 16 tables, 2 algorithms.

Key Result

Proposition 1

Let the pseudo-selection matrices in layers $l$ and $l+1$ be $\mathbf{S}_l$ and $\mathbf{S}_{l+1}$, respectively. The number of nonzero entries in the residual adapter satisfies For compression strategies that remove dependent structures for layer $l+1$ following $\mathbf{S}_l^\top \mathbf{S}_{l+1}$, this implies that the dimension in layer $l+1$ is less than or equal to that in layer $l$, with e

Figures (13)

  • Figure 1: We use an MLP layer as an example. Left: Regular pruning methods have to follow structural dependence thus their flexibility is limited. Right: Our dimension-independent structural pruning method breaks the structural dependence via index operations and thus largely improves the flexibility for pruning.
  • Figure 2: Our method, DISP-LLM, applies different selection matrices to the input and output dimension of the Attention layer and MLP layer ($\color{c1}{\mathbf{S}_1}/\color{c2}{\mathbf{S}_2}$: Attention in/out; $\color{c3}{\mathbf{S}_3}/\color{c4}{\mathbf{S}_4}/\color{c5}{\mathbf{S}_5}$: MLP in/middle/out). When pruning the model, we add "Index Selection" before Layer Norm and we replace addition with "Index Add." ${\color{c1}\hat{\mathbf{S}}_1}$, $\cdots$, ${\color{c5}\hat{\mathbf{S}}_5}$ are applied for pruning weight matrices.
  • Figure 3: Comparison of the projection matrices for structural pruning. We use $\mathbf{W}_{\text{in}}$ and $\mathbf{W}_{\text{out}}$ in Fig. \ref{['fig:disp-concept']} as an example. Left: SliceGPT employs orthogonal projection matrices, and it has to insert the projection matrices into the residual connections. Middle: Regular structural pruning methods remove structures based on their dependence, requiring to use the unified selection matrix $\mathbf{S}$ for all blocks, which limits flexibility. Right: Our method breaks the structural dependence, allowing the use of different selection matrices $\mathbf{S}_{in}$ and $\mathbf{S}_{out}$ for the embedding dimension, significantly improving the flexibility of pruning.
  • Figure 4: The pruned model architecture along the embedding dimension (model dimension) for the LLaMA-2 7B model when the pruning ratio equals 50%.
  • Figure 5: The training dynamics when learning the hypernetwork are shown in Figs. \ref{['ablation-l']}, \ref{['ablation-r']}, \ref{['p-l']}, \ref{['p-r']}. The results of different settings are in Figs. \ref{['ablation-p']}, \ref{['ablation-iterations']}, throughput is in Fig. \ref{['throughput']}, and cost is in Fig. \ref{['costs']}.
  • ...and 8 more figures

Theorems & Definitions (3)

  • Proposition 1: Decreasing feature dimensions for deeper layers
  • Proposition 1: Decreasing feature dimensions for deeper layers
  • proof