Table of Contents
Fetching ...

SLEB: Streamlining LLMs through Redundancy Verification and Elimination of Transformer Blocks

Jiwon Song, Kyungseok Oh, Taesu Kim, Hyungjun Kim, Yulhwa Kim, Jae-Joon Kim

TL;DR

SLEB introduces a training-free, block-level pruning method for LLMs by identifying and eliminating redundant transformer blocks. It relies on calibration-data–driven redundancy verification using metrics that account for evolving model behavior, ensuring end-to-end speedups that correlate with the number of removed blocks. Empirical results show SLEB preserves perplexity and zero-shot accuracy while delivering notable latency and throughput improvements across OPT and LLaMA-2 models, and remains compatible with 4-bit post-training quantization. The approach addresses key limitations of prior pruning and early-exit methods, delivering robust, hardware-friendly speedups without extensive retraining.

Abstract

Large language models (LLMs) have proven to be highly effective across various natural language processing tasks. However, their large number of parameters poses significant challenges for practical deployment. Pruning, a technique aimed at reducing the size and complexity of LLMs, offers a potential solution by removing redundant components from the network. Despite the promise of pruning, existing methods often struggle to achieve substantial end-to-end LLM inference speedup. In this paper, we introduce SLEB, a novel approach designed to streamline LLMs by eliminating redundant transformer blocks. We choose the transformer block as the fundamental unit for pruning, because LLMs exhibit block-level redundancy with high similarity between the outputs of neighboring blocks. This choice allows us to effectively enhance the processing speed of LLMs. Our experimental results demonstrate that SLEB outperforms previous LLM pruning methods in accelerating LLM inference while also maintaining superior perplexity and accuracy, making SLEB as a promising technique for enhancing the efficiency of LLMs. The code is available at: https://github.com/jiwonsong-dev/SLEB.

SLEB: Streamlining LLMs through Redundancy Verification and Elimination of Transformer Blocks

TL;DR

SLEB introduces a training-free, block-level pruning method for LLMs by identifying and eliminating redundant transformer blocks. It relies on calibration-data–driven redundancy verification using metrics that account for evolving model behavior, ensuring end-to-end speedups that correlate with the number of removed blocks. Empirical results show SLEB preserves perplexity and zero-shot accuracy while delivering notable latency and throughput improvements across OPT and LLaMA-2 models, and remains compatible with 4-bit post-training quantization. The approach addresses key limitations of prior pruning and early-exit methods, delivering robust, hardware-friendly speedups without extensive retraining.

Abstract

Large language models (LLMs) have proven to be highly effective across various natural language processing tasks. However, their large number of parameters poses significant challenges for practical deployment. Pruning, a technique aimed at reducing the size and complexity of LLMs, offers a potential solution by removing redundant components from the network. Despite the promise of pruning, existing methods often struggle to achieve substantial end-to-end LLM inference speedup. In this paper, we introduce SLEB, a novel approach designed to streamline LLMs by eliminating redundant transformer blocks. We choose the transformer block as the fundamental unit for pruning, because LLMs exhibit block-level redundancy with high similarity between the outputs of neighboring blocks. This choice allows us to effectively enhance the processing speed of LLMs. Our experimental results demonstrate that SLEB outperforms previous LLM pruning methods in accelerating LLM inference while also maintaining superior perplexity and accuracy, making SLEB as a promising technique for enhancing the efficiency of LLMs. The code is available at: https://github.com/jiwonsong-dev/SLEB.
Paper Structure (29 sections, 5 equations, 12 figures, 13 tables, 1 algorithm)

This paper contains 29 sections, 5 equations, 12 figures, 13 tables, 1 algorithm.

Figures (12)

  • Figure 1: Typical LLM architecture
  • Figure 2: Overview of previous pruning methods, 2:4 pruning and channel-wise (a.k.a. row/column-wise) pruning, and proposed SLEB
  • Figure 3: The speedup achieved through 2:4 pruning on matrix multiplication between $b \times h$ matrix and $h \times h$ ($b$: batch size, $h$: hidden size). The test is conducted on an NVIDIA RTX A6000 GPU. The dashed grey line represents the peak speedup attainable with 2:4 pruning. The speed of the 2:4 pruning case with batch sizes between 1 and 16 is measured using a dense matrix multiplication kernel, because NVIDIA CUTLASS effectively supports 2:4 sparse matrix multiplication with batch sizes larger than 16.
  • Figure 4: Perplexity comparison on WikiText-2 for OPT-6.7B (left) and OPT-13B (right) after removing consecutive transformer blocks. "Early Exit" refers to removing the very last blocks from the target model, while "Chunk Best" represents the best perplexity results achieved by testing all possible removable points of consecutive blocks.
  • Figure 5: Percentage of token prediction alignment with the final predictions of LLMs for each transformer block.
  • ...and 7 more figures