Table of Contents
Fetching ...

Anchor Attention, Small Cache: Code Generation with Large Language Models

Xiangyu Zhang, Yu Zhou, Guang Yang, Harald C. Gall, Taolue Chen

TL;DR

A novel approach is proposed, AnchorCoder, which features token-wise anchor attention designed to extract and compress the contextual information, and layer-wise anchor attention enabling cross-layer communication to mitigate the issue of excessive superposition caused by the compression.

Abstract

The development of large language models (LLMs) has revolutionized automated code generation. However, their high demand of computation resources has hindered a broader deployment and raised environmental concerns. A common strategy for diminishing computational demands is to cache Key-Value (KV) states from the attention mechanism which is adopted predominately by mainstream LLMs. It can mitigate the need of repeated attention computations, but brings significant memory overhead. Current practices in NLP often use sparse attention which may, unfortunately, lead to substantial inaccuracies, or hallucinations, in code generation tasks. In this paper, we analyze the attention weights distribution within code generation models via an empirical study, uncovering a sparsity pattern, i.e., the aggregation of information at specific anchor points. Based on this observation, we propose a novel approach, AnchorCoder, which features token-wise anchor attention designed to extract and compress the contextual information, and layer-wise anchor attention enabling cross-layer communication to mitigate the issue of excessive superposition caused by the compression. The extensive experiments across multiple benchmark datasets confirm the effectiveness of AnchorCoder, which can consistently achieve a significant (at least 70%) reduction in KV cache requirements, while preserving the majority of model's performance.

Anchor Attention, Small Cache: Code Generation with Large Language Models

TL;DR

A novel approach is proposed, AnchorCoder, which features token-wise anchor attention designed to extract and compress the contextual information, and layer-wise anchor attention enabling cross-layer communication to mitigate the issue of excessive superposition caused by the compression.

Abstract

The development of large language models (LLMs) has revolutionized automated code generation. However, their high demand of computation resources has hindered a broader deployment and raised environmental concerns. A common strategy for diminishing computational demands is to cache Key-Value (KV) states from the attention mechanism which is adopted predominately by mainstream LLMs. It can mitigate the need of repeated attention computations, but brings significant memory overhead. Current practices in NLP often use sparse attention which may, unfortunately, lead to substantial inaccuracies, or hallucinations, in code generation tasks. In this paper, we analyze the attention weights distribution within code generation models via an empirical study, uncovering a sparsity pattern, i.e., the aggregation of information at specific anchor points. Based on this observation, we propose a novel approach, AnchorCoder, which features token-wise anchor attention designed to extract and compress the contextual information, and layer-wise anchor attention enabling cross-layer communication to mitigate the issue of excessive superposition caused by the compression. The extensive experiments across multiple benchmark datasets confirm the effectiveness of AnchorCoder, which can consistently achieve a significant (at least 70%) reduction in KV cache requirements, while preserving the majority of model's performance.

Paper Structure

This paper contains 15 sections, 8 equations, 8 figures, 5 tables, 1 algorithm.

Figures (8)

  • Figure 1: AnchorCoder's performance on context compression.
  • Figure 2: Sparsity of attention weights in code LLMs.
  • Figure 3: The attention heatmap of CodeLlama-7B.
  • Figure 4: The results of the "needle-in-a-haystack" experiment.
  • Figure 5: The upper part provides the attention heatmap deploying different KV cache compression methods. The lower part provides the illustration of AnchorCoder.
  • ...and 3 more figures