CLAA: Cross-Layer Attention Aggregation for Accelerating LLM Prefill

Bradley McDanel; Steven Li; Harshit Khaitan

CLAA: Cross-Layer Attention Aggregation for Accelerating LLM Prefill

Bradley McDanel, Steven Li, Harshit Khaitan

TL;DR

The paper tackles the prefill bottleneck in long-context LLM inference by introducing an Answer-Informed Oracle that defines ground-truth token importance via attention from the generated answer. It reveals layer-wise instability in token-ranking signals and proposes Cross-Layer Attention Aggregation (CLAA) to robustly aggregate signals across multiple layers, deferring early compression and using a max-over-layer scheme at pruning time. CLAA closes the gap to the oracle upper bound and delivers substantial reductions in Time-to-First-Token ($TTFT$), achieving up to 39% faster prefill compared with Full KV Cache baselines while maintaining near-oracle accuracy across diverse tasks like LongBench, Needle-in-a-Haystack, and RULER. The approach offers a practical, architecture-agnostic improvement for prefill acceleration in long-context LLMs, with demonstrated robustness across models and contexts and clear guidance on hyperparameters such as the aggregation window and the number of initial uncompressed layers.

Abstract

The prefill stage in long-context LLM inference remains a computational bottleneck. Recent token-ranking heuristics accelerate inference by selectively processing a subset of semantically relevant tokens. However, existing methods suffer from unstable token importance estimation, often varying between layers. Evaluating token-ranking quality independently from heuristic-specific architectures is challenging. To address this, we introduce an Answer-Informed Oracle, which defines ground-truth token importance by measuring attention from generated answers back to the prompt. This oracle reveals that existing heuristics exhibit high variance across layers: rankings can degrade sharply at specific layers, a failure mode invisible to end-to-end benchmarks. The diagnosis suggests a simple fix: aggregate scores across layers rather than relying on any single one. We implement this as Cross-Layer Attention Aggregation (CLAA), which closes the gap to the oracle upper bound and reduces Time-to-First-Token (TTFT) by up to 39\% compared to the Full KV Cache baseline.

CLAA: Cross-Layer Attention Aggregation for Accelerating LLM Prefill

TL;DR

), achieving up to 39% faster prefill compared with Full KV Cache baselines while maintaining near-oracle accuracy across diverse tasks like LongBench, Needle-in-a-Haystack, and RULER. The approach offers a practical, architecture-agnostic improvement for prefill acceleration in long-context LLMs, with demonstrated robustness across models and contexts and clear guidance on hyperparameters such as the aggregation window and the number of initial uncompressed layers.

Abstract

Paper Structure (39 sections, 5 equations, 13 figures, 5 tables, 1 algorithm)

This paper contains 39 sections, 5 equations, 13 figures, 5 tables, 1 algorithm.

Introduction
Related Work
Approaches to Prefill Acceleration
Token Ranking Strategies in Prefill Acceleration
GemFilter.
FastKV.
Speculative Prefill.
KV Cache Management in Prefill Acceleration
Compression via Sequence Pruning.
Layer-wise Cache Compression.
Answer-Informed Oracle Framework
Oracle Construction
Oracle as an Upper-Bound Benchmark
Cross-Layer Attention Aggregation (CLAA)
The Oracle Reveals Layer-wise Instability
...and 24 more sections

Figures (13)

Figure 1: Illustration of our framework for evaluating token ranking heuristics for LLM Prefill acceleration. An Answer-Informed Oracle establishes a ground-truth token ranking by aggregating attention from the generated answer back to the prompt. This approach, which measures rank similarity between heuristic outputs and the oracle, motivates our Cross-Layer Attention Aggregation (CLAA) method that achieves higher alignment with the oracle.
Figure 2: Layer-wise token ranking performance on Llama-3.1-8B-Instruct. Spearman correlation with answer-informed oracle across LongBench tasks, comparing existing heuristics to our proposed CLAA method.
Figure 3: Needle-in-a-Haystack result of LLaMA-3.1-8B-Instruct with 40% token keep rate. X denotes out of memory on 80GB A100.
Figure 4: LongBench accuracy versus Time-to-First-Token (TTFT) for LLaMA-3.1-8B-Instruct on a 10k token sequence. Points correspond to 10%, 20%, and 40% keep rates.
Figure 5: End-to-end performance breakdown for a 10k token prompt and 32 token generation. Bars show Prefill (TTFT) and Decode time. Annotations indicate decode throughput (tokens per second) and KV cache size (GB) at the start of decode.
...and 8 more figures

CLAA: Cross-Layer Attention Aggregation for Accelerating LLM Prefill

TL;DR

Abstract

CLAA: Cross-Layer Attention Aggregation for Accelerating LLM Prefill

Authors

TL;DR

Abstract

Table of Contents

Figures (13)