Table of Contents
Fetching ...

Attention Is All You Need But You Don't Need All Of It For Inference of Large Language Models

Georgy Tyukin, Gbetondji J-S Dovonon, Jean Kaddour, Pasquale Minervini

TL;DR

The paper tackles the high inference cost of large language models arising from quadratic attention by systematically analyzing layer-skipping as a pruning strategy. It evaluates three skipping variants (MLP, attention, and full blocks) on Llama-v2 models (7B and 13B) across OpenLLM benchmarks, with and without preserving the last layer. Key findings show deeper attention layers are relatively redundant compared to MLPs, enabling meaningful speedups with minimal accuracy loss (e.g., ~1.8% on 13B when 33% of attention layers are removed) and substantial compute savings when dropping entire blocks. The results offer practical guidance for deploying faster inference, illustrating that selective attention-layer pruning provides favorable trade-offs between latency and accuracy, and situating layer redundancy as a general property of deep Transformers. Practical impact includes enabling lower-latency LLM serving without broad retraining or architectural changes, using targeted, compute-aware pruning at inference time.

Abstract

The inference demand for LLMs has skyrocketed in recent months, and serving models with low latencies remains challenging due to the quadratic input length complexity of the attention layers. In this work, we investigate the effect of dropping MLP and attention layers at inference time on the performance of Llama-v2 models. We find that dropping dreeper attention layers only marginally decreases performance but leads to the best speedups alongside dropping entire layers. For example, removing 33\% of attention layers in a 13B Llama2 model results in a 1.8\% drop in average performance over the OpenLLM benchmark. We also observe that skipping layers except the latter layers reduces performances for more layers skipped, except for skipping the attention layers.

Attention Is All You Need But You Don't Need All Of It For Inference of Large Language Models

TL;DR

The paper tackles the high inference cost of large language models arising from quadratic attention by systematically analyzing layer-skipping as a pruning strategy. It evaluates three skipping variants (MLP, attention, and full blocks) on Llama-v2 models (7B and 13B) across OpenLLM benchmarks, with and without preserving the last layer. Key findings show deeper attention layers are relatively redundant compared to MLPs, enabling meaningful speedups with minimal accuracy loss (e.g., ~1.8% on 13B when 33% of attention layers are removed) and substantial compute savings when dropping entire blocks. The results offer practical guidance for deploying faster inference, illustrating that selective attention-layer pruning provides favorable trade-offs between latency and accuracy, and situating layer redundancy as a general property of deep Transformers. Practical impact includes enabling lower-latency LLM serving without broad retraining or architectural changes, using targeted, compute-aware pruning at inference time.

Abstract

The inference demand for LLMs has skyrocketed in recent months, and serving models with low latencies remains challenging due to the quadratic input length complexity of the attention layers. In this work, we investigate the effect of dropping MLP and attention layers at inference time on the performance of Llama-v2 models. We find that dropping dreeper attention layers only marginally decreases performance but leads to the best speedups alongside dropping entire layers. For example, removing 33\% of attention layers in a 13B Llama2 model results in a 1.8\% drop in average performance over the OpenLLM benchmark. We also observe that skipping layers except the latter layers reduces performances for more layers skipped, except for skipping the attention layers.
Paper Structure (14 sections, 2 figures, 10 tables)

This paper contains 14 sections, 2 figures, 10 tables.

Figures (2)

  • Figure 1: Cosine similarity of Llama-v2 layers with the previous layer: We observe that the deeper the layer, the more its features are similar to the previous layer except for the very last layer.
  • Figure 2: Skip mechanisms for skipping single layers and entire Transformer blocks (ffwd and attention layers) during inference.