Table of Contents
Fetching ...

Beyond KV Caching: Shared Attention for Efficient LLMs

Bingli Liao, Danilo Vasconcellos Vargas

TL;DR

The paper tackles the efficiency gap in LLM inference by exploiting isotropic patterns in attention distributions to share pre-computed attention weights across multiple layers. It introduces Shared Attention (SA), which bypasses repeated softmax computations and reduces KV-cache requirements by reusing attention weights, supported by empirical isotropy analyses across models and pretraining dynamics. Across Llama2-7B and Llama3-8B, SA shows minimal accuracy loss when applied to later layers and can be further improved via fine-tuning on instruct data, suggesting practical deployment benefits in constrained environments. The work suggests future directions including pretraining SA and combining it with other memory- and computation-reduction techniques like CLA, potentially enabling broader, more efficient use of large transformers.

Abstract

The efficiency of large language models (LLMs) remains a critical challenge, particularly in contexts where computational resources are limited. Traditional attention mechanisms in these models, while powerful, require significant computational and memory resources due to the necessity of recalculating and storing attention weights across different layers. This paper introduces a novel Shared Attention (SA) mechanism, designed to enhance the efficiency of LLMs by directly sharing computed attention weights across multiple layers. Unlike previous methods that focus on sharing intermediate Key-Value (KV) caches, our approach utilizes the isotropic tendencies of attention distributions observed in advanced LLMs post-pretraining to reduce both the computational flops and the size of the KV cache required during inference. We empirically demonstrate that implementing SA across various LLMs results in minimal accuracy loss on standard benchmarks. Our findings suggest that SA not only conserves computational resources but also maintains robust model performance, thereby facilitating the deployment of more efficient LLMs in resource-constrained environments.

Beyond KV Caching: Shared Attention for Efficient LLMs

TL;DR

The paper tackles the efficiency gap in LLM inference by exploiting isotropic patterns in attention distributions to share pre-computed attention weights across multiple layers. It introduces Shared Attention (SA), which bypasses repeated softmax computations and reduces KV-cache requirements by reusing attention weights, supported by empirical isotropy analyses across models and pretraining dynamics. Across Llama2-7B and Llama3-8B, SA shows minimal accuracy loss when applied to later layers and can be further improved via fine-tuning on instruct data, suggesting practical deployment benefits in constrained environments. The work suggests future directions including pretraining SA and combining it with other memory- and computation-reduction techniques like CLA, potentially enabling broader, more efficient use of large transformers.

Abstract

The efficiency of large language models (LLMs) remains a critical challenge, particularly in contexts where computational resources are limited. Traditional attention mechanisms in these models, while powerful, require significant computational and memory resources due to the necessity of recalculating and storing attention weights across different layers. This paper introduces a novel Shared Attention (SA) mechanism, designed to enhance the efficiency of LLMs by directly sharing computed attention weights across multiple layers. Unlike previous methods that focus on sharing intermediate Key-Value (KV) caches, our approach utilizes the isotropic tendencies of attention distributions observed in advanced LLMs post-pretraining to reduce both the computational flops and the size of the KV cache required during inference. We empirically demonstrate that implementing SA across various LLMs results in minimal accuracy loss on standard benchmarks. Our findings suggest that SA not only conserves computational resources but also maintains robust model performance, thereby facilitating the deployment of more efficient LLMs in resource-constrained environments.
Paper Structure (15 sections, 5 figures, 1 table, 1 algorithm)

This paper contains 15 sections, 5 figures, 1 table, 1 algorithm.

Figures (5)

  • Figure 1: Illustration of various sharing algorithms. The MQA and GQA methods share the Key and Value caches with the Query within the same layer to reduce memory usage. The CLA method extends this by sharing the Key and Value caches across different layers. Our method, Shared Attention, advances this concept further by sharing the attention weights across multiple layers.
  • Figure 2: Layer-wise similarity of attention weights across various LLMs. The x-axis and y-axis represent the layer indices, while the z-axis depicts the cosine similarity values. The distinct similarity patterns are indicative of the specific functional roles each group of layers plays within the overall architecture.
  • Figure 3: Evolution of layer attention weights similarity throughout the pretraining phase of the Baichuan2 7B model, as it processes trained tokens from 220 billion to 2.6 trillion. The color gradient in the visualization represents cosine similarity, effectively illustrating the transition in attention patterns from the initial to the advanced stages of pretraining.
  • Figure 4: The figure illustrates the implementation of Shared Attention within specific layer segments of the model. Shared Attention spans from layer 27 to 30 for a four-layer segment and from layer 23 to 30 for an eight-layer segment.
  • Figure 5: The figure displays the weighted cumulative variance for the Llama2-7B-chat and Llama3-8B-instruct models. The two lower axes represent the model's structure: the left axis details the 32 layers, and the right axis shows the 32 heads within each layer. The z-axis represents the variance values.