Table of Contents
Fetching ...

CLAA: Cross-Layer Attention Aggregation for Accelerating LLM Prefill

Bradley McDanel, Steven Li, Harshit Khaitan

TL;DR

The paper tackles the prefill bottleneck in long-context LLM inference by introducing an Answer-Informed Oracle that defines ground-truth token importance via attention from the generated answer. It reveals layer-wise instability in token-ranking signals and proposes Cross-Layer Attention Aggregation (CLAA) to robustly aggregate signals across multiple layers, deferring early compression and using a max-over-layer scheme at pruning time. CLAA closes the gap to the oracle upper bound and delivers substantial reductions in Time-to-First-Token ($TTFT$), achieving up to 39% faster prefill compared with Full KV Cache baselines while maintaining near-oracle accuracy across diverse tasks like LongBench, Needle-in-a-Haystack, and RULER. The approach offers a practical, architecture-agnostic improvement for prefill acceleration in long-context LLMs, with demonstrated robustness across models and contexts and clear guidance on hyperparameters such as the aggregation window and the number of initial uncompressed layers.

Abstract

The prefill stage in long-context LLM inference remains a computational bottleneck. Recent token-ranking heuristics accelerate inference by selectively processing a subset of semantically relevant tokens. However, existing methods suffer from unstable token importance estimation, often varying between layers. Evaluating token-ranking quality independently from heuristic-specific architectures is challenging. To address this, we introduce an Answer-Informed Oracle, which defines ground-truth token importance by measuring attention from generated answers back to the prompt. This oracle reveals that existing heuristics exhibit high variance across layers: rankings can degrade sharply at specific layers, a failure mode invisible to end-to-end benchmarks. The diagnosis suggests a simple fix: aggregate scores across layers rather than relying on any single one. We implement this as Cross-Layer Attention Aggregation (CLAA), which closes the gap to the oracle upper bound and reduces Time-to-First-Token (TTFT) by up to 39\% compared to the Full KV Cache baseline.

CLAA: Cross-Layer Attention Aggregation for Accelerating LLM Prefill

TL;DR

The paper tackles the prefill bottleneck in long-context LLM inference by introducing an Answer-Informed Oracle that defines ground-truth token importance via attention from the generated answer. It reveals layer-wise instability in token-ranking signals and proposes Cross-Layer Attention Aggregation (CLAA) to robustly aggregate signals across multiple layers, deferring early compression and using a max-over-layer scheme at pruning time. CLAA closes the gap to the oracle upper bound and delivers substantial reductions in Time-to-First-Token (), achieving up to 39% faster prefill compared with Full KV Cache baselines while maintaining near-oracle accuracy across diverse tasks like LongBench, Needle-in-a-Haystack, and RULER. The approach offers a practical, architecture-agnostic improvement for prefill acceleration in long-context LLMs, with demonstrated robustness across models and contexts and clear guidance on hyperparameters such as the aggregation window and the number of initial uncompressed layers.

Abstract

The prefill stage in long-context LLM inference remains a computational bottleneck. Recent token-ranking heuristics accelerate inference by selectively processing a subset of semantically relevant tokens. However, existing methods suffer from unstable token importance estimation, often varying between layers. Evaluating token-ranking quality independently from heuristic-specific architectures is challenging. To address this, we introduce an Answer-Informed Oracle, which defines ground-truth token importance by measuring attention from generated answers back to the prompt. This oracle reveals that existing heuristics exhibit high variance across layers: rankings can degrade sharply at specific layers, a failure mode invisible to end-to-end benchmarks. The diagnosis suggests a simple fix: aggregate scores across layers rather than relying on any single one. We implement this as Cross-Layer Attention Aggregation (CLAA), which closes the gap to the oracle upper bound and reduces Time-to-First-Token (TTFT) by up to 39\% compared to the Full KV Cache baseline.
Paper Structure (39 sections, 5 equations, 13 figures, 5 tables, 1 algorithm)

This paper contains 39 sections, 5 equations, 13 figures, 5 tables, 1 algorithm.

Figures (13)

  • Figure 1: Illustration of our framework for evaluating token ranking heuristics for LLM Prefill acceleration. An Answer-Informed Oracle establishes a ground-truth token ranking by aggregating attention from the generated answer back to the prompt. This approach, which measures rank similarity between heuristic outputs and the oracle, motivates our Cross-Layer Attention Aggregation (CLAA) method that achieves higher alignment with the oracle.
  • Figure 2: Layer-wise token ranking performance on Llama-3.1-8B-Instruct. Spearman correlation with answer-informed oracle across LongBench tasks, comparing existing heuristics to our proposed CLAA method.
  • Figure 3: Needle-in-a-Haystack result of LLaMA-3.1-8B-Instruct with 40% token keep rate. X denotes out of memory on 80GB A100.
  • Figure 4: LongBench accuracy versus Time-to-First-Token (TTFT) for LLaMA-3.1-8B-Instruct on a 10k token sequence. Points correspond to 10%, 20%, and 40% keep rates.
  • Figure 5: End-to-end performance breakdown for a 10k token prompt and 32 token generation. Bars show Prefill (TTFT) and Decode time. Annotations indicate decode throughput (tokens per second) and KV cache size (GB) at the start of decode.
  • ...and 8 more figures