Table of Contents
Fetching ...

Reassessing Layer Pruning in LLMs: New Insights and Methods

Yao Lu, Hao Cheng, Yujie Fang, Zeyu Wang, Jiaheng Wei, Dongwei Xu, Qi Xuan, Xiaoniu Yang, Zhaowei Zhu

TL;DR

The results demonstrate that a simple approach, i.e., pruning the final 25\% of layers followed by fine-tuning the remaining last three layer, yields remarkably strong performance.

Abstract

Although large language models (LLMs) have achieved remarkable success across various domains, their considerable scale necessitates substantial computational resources, posing significant challenges for deployment in resource-constrained environments. Layer pruning, as a simple yet effective compression method, removes layers of a model directly, reducing computational overhead. However, what are the best practices for layer pruning in LLMs? Are sophisticated layer selection metrics truly effective? Does the LoRA (Low-Rank Approximation) family, widely regarded as a leading method for pruned model fine-tuning, truly meet expectations when applied to post-pruning fine-tuning? To answer these questions, we dedicate thousands of GPU hours to benchmarking layer pruning in LLMs and gaining insights across multiple dimensions. Our results demonstrate that a simple approach, i.e., pruning the final 25\% of layers followed by fine-tuning the \texttt{lm\_head} and the remaining last three layer, yields remarkably strong performance. Following this guide, we prune Llama-3.1-8B-It and obtain a model that outperforms many popular LLMs of similar size, such as ChatGLM2-6B, Vicuna-7B-v1.5, Qwen1.5-7B and Baichuan2-7B. We release the optimal model weights on Huggingface, and the code is available on GitHub.

Reassessing Layer Pruning in LLMs: New Insights and Methods

TL;DR

The results demonstrate that a simple approach, i.e., pruning the final 25\% of layers followed by fine-tuning the remaining last three layer, yields remarkably strong performance.

Abstract

Although large language models (LLMs) have achieved remarkable success across various domains, their considerable scale necessitates substantial computational resources, posing significant challenges for deployment in resource-constrained environments. Layer pruning, as a simple yet effective compression method, removes layers of a model directly, reducing computational overhead. However, what are the best practices for layer pruning in LLMs? Are sophisticated layer selection metrics truly effective? Does the LoRA (Low-Rank Approximation) family, widely regarded as a leading method for pruned model fine-tuning, truly meet expectations when applied to post-pruning fine-tuning? To answer these questions, we dedicate thousands of GPU hours to benchmarking layer pruning in LLMs and gaining insights across multiple dimensions. Our results demonstrate that a simple approach, i.e., pruning the final 25\% of layers followed by fine-tuning the \texttt{lm\_head} and the remaining last three layer, yields remarkably strong performance. Following this guide, we prune Llama-3.1-8B-It and obtain a model that outperforms many popular LLMs of similar size, such as ChatGLM2-6B, Vicuna-7B-v1.5, Qwen1.5-7B and Baichuan2-7B. We release the optimal model weights on Huggingface, and the code is available on GitHub.

Paper Structure

This paper contains 15 sections, 4 equations, 4 figures, 14 tables.

Figures (4)

  • Figure 1: Insights for best practices (left) and the pruned models (right). Insights: 1) Prune from the tail. 2) Fine-tune the last few layers (instead of using LoRA). 3) Iterative pruning benefits rarely. Pruned models: Llama-3.1-6.3B-It-Alpaca and Llama-3.1-6.3B-It-Dolly achieve a good trade-off between performance and model size, as they are positioned in the top left corner.
  • Figure 2: The effect of different pruning rates on LLM layer pruning.
  • Figure A: The effect of different pruning rates on LLM layer pruning using random metric.
  • Figure B: Visualization of the layer similarity matrix of 16-layer Llama-3.1-8B-It models (using Taylor) obtained by different pruning strategies. Left: one-shot pruning; Middle: iterative pruning with pruning step = 1; Right: iterative pruning with pruning step = 8.