Streamlining Redundant Layers to Compress Large Language Models
Xiaodong Chen, Yuxuan Hu, Jing Zhang, Yanling Wang, Cuiping Li, Hong Chen
TL;DR
This work introduces LLM-Streamline, a layer-pruning-and-replacement framework for large language models that identifies and removes contiguous, low-impact layers based on cosine similarity of hidden states, then compensates the loss via a lightweight replacement network. It also proposes a stability metric to overcome limitations of accuracy for evaluating compressed models, emphasizing prediction confidence and consistency. Through extensive experiments across diverse models and tasks, the method achieves state-of-the-art performance and training efficiency, with FFN-based replacements often delivering the best results and Layer Replacement outperforming LoRA in accuracy, stability, and resource usage. The approach demonstrates strong potential for practical deployment, offering a principled, data-efficient route to compressing large transformers while preserving performance, and it provides a foundation for combining pruning with other compression techniques.
Abstract
This paper introduces LLM-Streamline, a pioneer work on layer pruning for large language models (LLMs). It is based on the observation that different layers have varying impacts on hidden states, enabling the identification of less important layers to be pruned.LLM-Streamline comprises two parts: layer pruning, which removes consecutive layers with the lowest importance based on target sparsity, and layer replacement, a novel module that trains a lightweight network to replace the pruned layers to mitigate performance loss. Additionally, a new metric called stability is proposed to address the limitations of the widely used accuracy metric in evaluating model compression. Experiments show that LLM-Streamline outperforms both previous and concurrent state-of-the-art pruning methods in terms of both performance and training efficiency.Our code is available at https://github.com/RUCKBReasoning/LLM-Streamline
