Table of Contents
Fetching ...

Entropy-Based Block Pruning for Efficient Large Language Models

Liangwei Yang, Yuhui Xu, Juntao Tan, Doyen Sahoo, Silvio Savarese, Caiming Xiong, Huan Wang, Shelby Heinecke

TL;DR

This paper tackles the redundancy of large Transformer-based language models by introducing EntroDrop, an entropy-increase-based block pruning method. It uncovers a two-stage entropy dynamics pattern across Transformer layers and uses this insight to rank and prune computation blocks via the entropy increase criterion, specifically ΔH^l = H(Z^l) − H(Z^{l−1}). EntroDrop relies on a calibration dataset and compares multiple entropy estimators (Bucket, KNN, Renyi), with Bucket and KNN delivering robust pruning performance and speedups, while Renyi underperforms. Across Llama3.1-8B and Mistral-7B-v0.3, EntroDrop consistently outperforms cosine-similarity-based baselines and achieves substantial inference speedups (notably after pruning up to 12 attention layers) with negligible accuracy loss, demonstrating practical impact for efficient LLM deployment.

Abstract

As large language models continue to scale, their growing computational and storage demands pose significant challenges for real-world deployment. In this work, we investigate redundancy within Transformer-based models and propose an entropy-based pruning strategy to enhance efficiency while maintaining performance. Empirical analysis reveals that the entropy of hidden representations decreases in the early blocks but progressively increases across most subsequent blocks. This trend suggests that entropy serves as a more effective measure of information richness within computation blocks. Unlike cosine similarity, which primarily captures geometric relationships, entropy directly quantifies uncertainty and information content, making it a more reliable criterion for pruning. Extensive experiments demonstrate that our entropy-based pruning approach surpasses cosine similarity-based methods in reducing model size while preserving accuracy, offering a promising direction for efficient model deployment.

Entropy-Based Block Pruning for Efficient Large Language Models

TL;DR

This paper tackles the redundancy of large Transformer-based language models by introducing EntroDrop, an entropy-increase-based block pruning method. It uncovers a two-stage entropy dynamics pattern across Transformer layers and uses this insight to rank and prune computation blocks via the entropy increase criterion, specifically ΔH^l = H(Z^l) − H(Z^{l−1}). EntroDrop relies on a calibration dataset and compares multiple entropy estimators (Bucket, KNN, Renyi), with Bucket and KNN delivering robust pruning performance and speedups, while Renyi underperforms. Across Llama3.1-8B and Mistral-7B-v0.3, EntroDrop consistently outperforms cosine-similarity-based baselines and achieves substantial inference speedups (notably after pruning up to 12 attention layers) with negligible accuracy loss, demonstrating practical impact for efficient LLM deployment.

Abstract

As large language models continue to scale, their growing computational and storage demands pose significant challenges for real-world deployment. In this work, we investigate redundancy within Transformer-based models and propose an entropy-based pruning strategy to enhance efficiency while maintaining performance. Empirical analysis reveals that the entropy of hidden representations decreases in the early blocks but progressively increases across most subsequent blocks. This trend suggests that entropy serves as a more effective measure of information richness within computation blocks. Unlike cosine similarity, which primarily captures geometric relationships, entropy directly quantifies uncertainty and information content, making it a more reliable criterion for pruning. Extensive experiments demonstrate that our entropy-based pruning approach surpasses cosine similarity-based methods in reducing model size while preserving accuracy, offering a promising direction for efficient model deployment.

Paper Structure

This paper contains 17 sections, 7 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Entropy Dynamics among Layers during Inference
  • Figure 2: Overview of the EntroDrop framework. Stage 1 keeps intact, while Stage 2 exhibits increasing entropy. Blocks in Stage 2 are ranked based on their entropy increase, and those with the lowest increase are pruned.
  • Figure 3: Heatmap of Calibration Datasets
  • Figure 4: Impact of Calibration Datasets.
  • Figure 5: Impact of Entropy Estimate Methods.
  • ...and 3 more figures