Table of Contents
Fetching ...

Streamlining Redundant Layers to Compress Large Language Models

Xiaodong Chen, Yuxuan Hu, Jing Zhang, Yanling Wang, Cuiping Li, Hong Chen

TL;DR

This work introduces LLM-Streamline, a layer-pruning-and-replacement framework for large language models that identifies and removes contiguous, low-impact layers based on cosine similarity of hidden states, then compensates the loss via a lightweight replacement network. It also proposes a stability metric to overcome limitations of accuracy for evaluating compressed models, emphasizing prediction confidence and consistency. Through extensive experiments across diverse models and tasks, the method achieves state-of-the-art performance and training efficiency, with FFN-based replacements often delivering the best results and Layer Replacement outperforming LoRA in accuracy, stability, and resource usage. The approach demonstrates strong potential for practical deployment, offering a principled, data-efficient route to compressing large transformers while preserving performance, and it provides a foundation for combining pruning with other compression techniques.

Abstract

This paper introduces LLM-Streamline, a pioneer work on layer pruning for large language models (LLMs). It is based on the observation that different layers have varying impacts on hidden states, enabling the identification of less important layers to be pruned.LLM-Streamline comprises two parts: layer pruning, which removes consecutive layers with the lowest importance based on target sparsity, and layer replacement, a novel module that trains a lightweight network to replace the pruned layers to mitigate performance loss. Additionally, a new metric called stability is proposed to address the limitations of the widely used accuracy metric in evaluating model compression. Experiments show that LLM-Streamline outperforms both previous and concurrent state-of-the-art pruning methods in terms of both performance and training efficiency.Our code is available at https://github.com/RUCKBReasoning/LLM-Streamline

Streamlining Redundant Layers to Compress Large Language Models

TL;DR

This work introduces LLM-Streamline, a layer-pruning-and-replacement framework for large language models that identifies and removes contiguous, low-impact layers based on cosine similarity of hidden states, then compensates the loss via a lightweight replacement network. It also proposes a stability metric to overcome limitations of accuracy for evaluating compressed models, emphasizing prediction confidence and consistency. Through extensive experiments across diverse models and tasks, the method achieves state-of-the-art performance and training efficiency, with FFN-based replacements often delivering the best results and Layer Replacement outperforming LoRA in accuracy, stability, and resource usage. The approach demonstrates strong potential for practical deployment, offering a principled, data-efficient route to compressing large transformers while preserving performance, and it provides a foundation for combining pruning with other compression techniques.

Abstract

This paper introduces LLM-Streamline, a pioneer work on layer pruning for large language models (LLMs). It is based on the observation that different layers have varying impacts on hidden states, enabling the identification of less important layers to be pruned.LLM-Streamline comprises two parts: layer pruning, which removes consecutive layers with the lowest importance based on target sparsity, and layer replacement, a novel module that trains a lightweight network to replace the pruned layers to mitigate performance loss. Additionally, a new metric called stability is proposed to address the limitations of the widely used accuracy metric in evaluating model compression. Experiments show that LLM-Streamline outperforms both previous and concurrent state-of-the-art pruning methods in terms of both performance and training efficiency.Our code is available at https://github.com/RUCKBReasoning/LLM-Streamline
Paper Structure (37 sections, 6 equations, 4 figures, 34 tables)

This paper contains 37 sections, 6 equations, 4 figures, 34 tables.

Figures (4)

  • Figure 1: The left side of the figure illustrates the LLM-Streamline workflow, which includes layer pruning to remove consecutive layers and layer replacement where a lightweight network is trained to replace the pruned layers. The right side of the figure presents the comparison results of LLM-Streamline with the state-of-the-art (SOTA) pruning methods on 12 classification benchmarks (details in Section \ref{['sec: benchmark']}) after pruning about 25% of the parameters on Llama2-7B. LLM-Streamline achieves 11.2% higher relative accuracy than these methods, where the relative accuracy represents the percentage of the original model’s accuracy retained by the pruning method.
  • Figure 2: The cosine similarity between the input and output hidden states of each layer in OPT-1.3B, OPT-2.7B, OPT-6.7B, and Llama2-7B.
  • Figure 3: Validation loss curves during training of (a) FFN and SwiGLU; (b) Transformer layer.
  • Figure 4: (a) Stability of the pruned Llama2-7B at different pruning ratios. (b) Accuracy of the pruned Llama2-7B at different pruning ratios, compared to the original Llama2-7B, OpenLlama-3B-v2, and TinyLlama-1.1B. Metrics are averaged across classification benchmarks.