Table of Contents
Fetching ...

What Matters in Transformers? Not All Attention is Needed

Shwai He, Guoheng Sun, Zheyu Shen, Ang Li

TL;DR

Transformer-based LLMs exhibit structured redundancy, especially in attention layers, leading to unnecessary compute and memory costs. The authors introduce a similarity-based metric to quantify module importance and demonstrate that attention layers are markedly redundant, enabling substantial pruning with minimal accuracy loss via Attention Drop and Joint Layer Drop. These methods achieve significant speedups and KV-cache memory reductions across models (e.g., up to 48.4% speedup on Llama-2-70B with 2.4% performance loss), and larger models remain robust to pruning. The work provides practical design guidance for efficient transformers and offers publicly available code to replicate and extend these results.

Abstract

While scaling Transformer-based large language models (LLMs) has demonstrated promising performance across various tasks, it also introduces redundant architectures, posing efficiency challenges for real-world deployment. Despite some recognition of redundancy in LLMs, the variability of redundancy across different architectures in transformers, such as MLP and Attention layers, is under-explored. In this work, we investigate redundancy across different modules within Transformers, including Blocks, MLP, and Attention layers, using a similarity-based metric. Surprisingly, despite the critical role of attention layers in distinguishing transformers from other architectures, we found that a large portion of these layers exhibit excessively high similarity and can be pruned without degrading performance. For instance, Llama-2-70B achieved a 48.4\% speedup with only a 2.4\% performance drop by pruning half of the attention layers. Furthermore, by tracing model checkpoints throughout the training process, we observed that attention layer redundancy is inherent and consistent across training stages. Additionally, we further propose a method that jointly drops Attention and MLP layers, allowing us to more aggressively drop additional layers. For instance, when dropping 31 layers (Attention + MLP), Llama-2-13B still retains 90\% of the performance on the MMLU task. Our work provides valuable insights for future network architecture design. The code is released at: \url{https://github.com/Shwai-He/LLM-Drop}.

What Matters in Transformers? Not All Attention is Needed

TL;DR

Transformer-based LLMs exhibit structured redundancy, especially in attention layers, leading to unnecessary compute and memory costs. The authors introduce a similarity-based metric to quantify module importance and demonstrate that attention layers are markedly redundant, enabling substantial pruning with minimal accuracy loss via Attention Drop and Joint Layer Drop. These methods achieve significant speedups and KV-cache memory reductions across models (e.g., up to 48.4% speedup on Llama-2-70B with 2.4% performance loss), and larger models remain robust to pruning. The work provides practical design guidance for efficient transformers and offers publicly available code to replicate and extend these results.

Abstract

While scaling Transformer-based large language models (LLMs) has demonstrated promising performance across various tasks, it also introduces redundant architectures, posing efficiency challenges for real-world deployment. Despite some recognition of redundancy in LLMs, the variability of redundancy across different architectures in transformers, such as MLP and Attention layers, is under-explored. In this work, we investigate redundancy across different modules within Transformers, including Blocks, MLP, and Attention layers, using a similarity-based metric. Surprisingly, despite the critical role of attention layers in distinguishing transformers from other architectures, we found that a large portion of these layers exhibit excessively high similarity and can be pruned without degrading performance. For instance, Llama-2-70B achieved a 48.4\% speedup with only a 2.4\% performance drop by pruning half of the attention layers. Furthermore, by tracing model checkpoints throughout the training process, we observed that attention layer redundancy is inherent and consistent across training stages. Additionally, we further propose a method that jointly drops Attention and MLP layers, allowing us to more aggressively drop additional layers. For instance, when dropping 31 layers (Attention + MLP), Llama-2-13B still retains 90\% of the performance on the MMLU task. Our work provides valuable insights for future network architecture design. The code is released at: \url{https://github.com/Shwai-He/LLM-Drop}.
Paper Structure (43 sections, 7 equations, 11 figures, 11 tables)

This paper contains 43 sections, 7 equations, 11 figures, 11 tables.

Figures (11)

  • Figure 1: Visualization of Block Drop. where we use to denote the blocks with high similarity scores (i.e., low importance scores). The dropped blocks are blurred.
  • Figure 3: Importance scores for MLP and Attention layers, where we use various calibration datasets for a comprehensive analysis.
  • Figure 4: Visualization of Layer Drop, where we visualize dropping either MLP or Attention Layers. Given the residual connection, we take LayerNorm together with the corresponding layers. The dropped layers with high similarity scores (i.e., low importance scores) are blurred.
  • Figure 5: Performance with respect to Dropping Ratios. The solid lines represent the impact of dropping the $n$ modules with the lowest importance scores in Mistral-7B and Llama-2-13B, and the dotted lines represent the performances of the baseline and random guessing.
  • Figure 6: Visualization of Dropping Order for Block Drop and Layer Drop. We visualize the remaining layers and blocks under different dropped numbers, where yellow areas represent the retained layers/blocks and red areas indicate the dropped portions.
  • ...and 6 more figures