FinerCut: Finer-grained Interpretable Layer Pruning for Large Language Models

Yang Zhang; Yawei Li; Xinpeng Wang; Qianli Shen; Barbara Plank; Bernd Bischl; Mina Rezaei; Kenji Kawaguchi

FinerCut: Finer-grained Interpretable Layer Pruning for Large Language Models

Yang Zhang, Yawei Li, Xinpeng Wang, Qianli Shen, Barbara Plank, Bernd Bischl, Mina Rezaei, Kenji Kawaguchi

TL;DR

FinerCut is proposed, a new form of fine-grained layer pruning, which in contrast to prior work at the transformer block level, considers all self-attention and feed-forward network layers within blocks as individual pruning candidates.

Abstract

Overparametrized transformer networks are the state-of-the-art architecture for Large Language Models (LLMs). However, such models contain billions of parameters making large compute a necessity, while raising environmental concerns. To address these issues, we propose FinerCut, a new form of fine-grained layer pruning, which in contrast to prior work at the transformer block level, considers all self-attention and feed-forward network (FFN) layers within blocks as individual pruning candidates. FinerCut prunes layers whose removal causes minimal alternation to the model's output -- contributing to a new, lean, interpretable, and task-agnostic pruning method. Tested across 9 benchmarks, our approach retains 90% performance of Llama3-8B with 25% layers removed, and 95% performance of Llama3-70B with 30% layers removed, all without fine-tuning or post-pruning reconstruction. Strikingly, we observe intriguing results with FinerCut: 42% (34 out of 80) of the self-attention layers in Llama3-70B can be removed while preserving 99% of its performance -- without additional fine-tuning after removal. Moreover, FinerCut provides a tool to inspect the types and locations of pruned layers, allowing to observe interesting pruning behaviors. For instance, we observe a preference for pruning self-attention layers, often at deeper consecutive decoder layers. We hope our insights inspire future efficient LLM architecture designs.

FinerCut: Finer-grained Interpretable Layer Pruning for Large Language Models

TL;DR

Abstract

Paper Structure (30 sections, 6 equations, 7 figures, 16 tables, 1 algorithm)

This paper contains 30 sections, 6 equations, 7 figures, 16 tables, 1 algorithm.

Introduction
Related works
Method
Preliminaries of LLMs
Formulation of structured pruning for LLMs
Iterative search algorithm as an efficient and approximate solver
Choices of metric functions
Experiments
Experiment setup
Main result
Analysis of pruned layers
Ablation study
Limitation
Conclusion and future works
Broader impact
...and 15 more sections

Figures (7)

Figure 1: (a): Overview of FinerCut. FinerCut iteratively examines candidate attention and FFN layers to find the next pruning target that minimizes output discrepancy compared to the original model. (b): Overview of pruned layers in type. More attention layers are removed than FFN layers. (c): Three major pruning behaviors we observed. Apart from pruning a transformer block or merging multiple transformer blocks to one transformer block through pruning, FinerCut tends to remove attention layers in consecutive transformer blocks.
Figure 2: Average zero/few-shot performance at different layer pruning ratios. We omit SliceGPT in (c) because it does not support MoE.
Figure 3: Perplexity (with a logarithmic scale) on WikiText2 at varying layer pruning ratios. Compared to ShortGPT and DeeperLayers, our method better preserves the language modeling capabilities.
Figure 4: Visualization of pruned layers at $25\%$ layer pruning ratio for Llama3-70B (top, with $1.9\%$ performance drop), Llama3-8B (middle, with $11.6\%$ performance drop), and Mixtral-8x7B (bottom, with $17.0\%$ performance drop) using $\textsc{FinerCut}$. indicates pruned self-attention layers, indicates pruned FFN layers, and indicates remaining layers. Notably, consecutive self-attention layers are removed, resulting in a heterogeneous structure where multiple FFNs process the output of one attention layer. More discussion in Section \ref{['sec:analyze_pruned_layers']}.
Figure 5: Ablation study on Llama3-70B. (a) and (b): Pruning transformer blocks vs. pruning attention and FFN layers separately. (c) and (d): Comparison of three distance metrics in FinerCut.
...and 2 more figures

FinerCut: Finer-grained Interpretable Layer Pruning for Large Language Models

TL;DR

Abstract

FinerCut: Finer-grained Interpretable Layer Pruning for Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (7)