Table of Contents
Fetching ...

FinerCut: Finer-grained Interpretable Layer Pruning for Large Language Models

Yang Zhang, Yawei Li, Xinpeng Wang, Qianli Shen, Barbara Plank, Bernd Bischl, Mina Rezaei, Kenji Kawaguchi

TL;DR

FinerCut is proposed, a new form of fine-grained layer pruning, which in contrast to prior work at the transformer block level, considers all self-attention and feed-forward network layers within blocks as individual pruning candidates.

Abstract

Overparametrized transformer networks are the state-of-the-art architecture for Large Language Models (LLMs). However, such models contain billions of parameters making large compute a necessity, while raising environmental concerns. To address these issues, we propose FinerCut, a new form of fine-grained layer pruning, which in contrast to prior work at the transformer block level, considers all self-attention and feed-forward network (FFN) layers within blocks as individual pruning candidates. FinerCut prunes layers whose removal causes minimal alternation to the model's output -- contributing to a new, lean, interpretable, and task-agnostic pruning method. Tested across 9 benchmarks, our approach retains 90% performance of Llama3-8B with 25% layers removed, and 95% performance of Llama3-70B with 30% layers removed, all without fine-tuning or post-pruning reconstruction. Strikingly, we observe intriguing results with FinerCut: 42% (34 out of 80) of the self-attention layers in Llama3-70B can be removed while preserving 99% of its performance -- without additional fine-tuning after removal. Moreover, FinerCut provides a tool to inspect the types and locations of pruned layers, allowing to observe interesting pruning behaviors. For instance, we observe a preference for pruning self-attention layers, often at deeper consecutive decoder layers. We hope our insights inspire future efficient LLM architecture designs.

FinerCut: Finer-grained Interpretable Layer Pruning for Large Language Models

TL;DR

FinerCut is proposed, a new form of fine-grained layer pruning, which in contrast to prior work at the transformer block level, considers all self-attention and feed-forward network layers within blocks as individual pruning candidates.

Abstract

Overparametrized transformer networks are the state-of-the-art architecture for Large Language Models (LLMs). However, such models contain billions of parameters making large compute a necessity, while raising environmental concerns. To address these issues, we propose FinerCut, a new form of fine-grained layer pruning, which in contrast to prior work at the transformer block level, considers all self-attention and feed-forward network (FFN) layers within blocks as individual pruning candidates. FinerCut prunes layers whose removal causes minimal alternation to the model's output -- contributing to a new, lean, interpretable, and task-agnostic pruning method. Tested across 9 benchmarks, our approach retains 90% performance of Llama3-8B with 25% layers removed, and 95% performance of Llama3-70B with 30% layers removed, all without fine-tuning or post-pruning reconstruction. Strikingly, we observe intriguing results with FinerCut: 42% (34 out of 80) of the self-attention layers in Llama3-70B can be removed while preserving 99% of its performance -- without additional fine-tuning after removal. Moreover, FinerCut provides a tool to inspect the types and locations of pruned layers, allowing to observe interesting pruning behaviors. For instance, we observe a preference for pruning self-attention layers, often at deeper consecutive decoder layers. We hope our insights inspire future efficient LLM architecture designs.
Paper Structure (30 sections, 6 equations, 7 figures, 16 tables, 1 algorithm)

This paper contains 30 sections, 6 equations, 7 figures, 16 tables, 1 algorithm.

Figures (7)

  • Figure 1: (a): Overview of FinerCut. FinerCut iteratively examines candidate attention and FFN layers to find the next pruning target that minimizes output discrepancy compared to the original model. (b): Overview of pruned layers in type. More attention layers are removed than FFN layers. (c): Three major pruning behaviors we observed. Apart from pruning a transformer block or merging multiple transformer blocks to one transformer block through pruning, FinerCut tends to remove attention layers in consecutive transformer blocks.
  • Figure 2: Average zero/few-shot performance at different layer pruning ratios. We omit SliceGPT in (c) because it does not support MoE.
  • Figure 3: Perplexity (with a logarithmic scale) on WikiText2 at varying layer pruning ratios. Compared to ShortGPT and DeeperLayers, our method better preserves the language modeling capabilities.
  • Figure 4: Visualization of pruned layers at $25\%$ layer pruning ratio for Llama3-70B (top, with $1.9\%$ performance drop), Llama3-8B (middle, with $11.6\%$ performance drop), and Mixtral-8x7B (bottom, with $17.0\%$ performance drop) using $\textsc{FinerCut}$. indicates pruned self-attention layers, indicates pruned FFN layers, and indicates remaining layers. Notably, consecutive self-attention layers are removed, resulting in a heterogeneous structure where multiple FFNs process the output of one attention layer. More discussion in Section \ref{['sec:analyze_pruned_layers']}.
  • Figure 5: Ablation study on Llama3-70B. (a) and (b): Pruning transformer blocks vs. pruning attention and FFN layers separately. (c) and (d): Comparison of three distance metrics in FinerCut.
  • ...and 2 more figures