Attention Is All You Need But You Don't Need All Of It For Inference of Large Language Models
Georgy Tyukin, Gbetondji J-S Dovonon, Jean Kaddour, Pasquale Minervini
TL;DR
The paper tackles the high inference cost of large language models arising from quadratic attention by systematically analyzing layer-skipping as a pruning strategy. It evaluates three skipping variants (MLP, attention, and full blocks) on Llama-v2 models (7B and 13B) across OpenLLM benchmarks, with and without preserving the last layer. Key findings show deeper attention layers are relatively redundant compared to MLPs, enabling meaningful speedups with minimal accuracy loss (e.g., ~1.8% on 13B when 33% of attention layers are removed) and substantial compute savings when dropping entire blocks. The results offer practical guidance for deploying faster inference, illustrating that selective attention-layer pruning provides favorable trade-offs between latency and accuracy, and situating layer redundancy as a general property of deep Transformers. Practical impact includes enabling lower-latency LLM serving without broad retraining or architectural changes, using targeted, compute-aware pruning at inference time.
Abstract
The inference demand for LLMs has skyrocketed in recent months, and serving models with low latencies remains challenging due to the quadratic input length complexity of the attention layers. In this work, we investigate the effect of dropping MLP and attention layers at inference time on the performance of Llama-v2 models. We find that dropping dreeper attention layers only marginally decreases performance but leads to the best speedups alongside dropping entire layers. For example, removing 33\% of attention layers in a 13B Llama2 model results in a 1.8\% drop in average performance over the OpenLLM benchmark. We also observe that skipping layers except the latter layers reduces performances for more layers skipped, except for skipping the attention layers.
