CLAA: Cross-Layer Attention Aggregation for Accelerating LLM Prefill
Bradley McDanel, Steven Li, Harshit Khaitan
TL;DR
The paper tackles the prefill bottleneck in long-context LLM inference by introducing an Answer-Informed Oracle that defines ground-truth token importance via attention from the generated answer. It reveals layer-wise instability in token-ranking signals and proposes Cross-Layer Attention Aggregation (CLAA) to robustly aggregate signals across multiple layers, deferring early compression and using a max-over-layer scheme at pruning time. CLAA closes the gap to the oracle upper bound and delivers substantial reductions in Time-to-First-Token ($TTFT$), achieving up to 39% faster prefill compared with Full KV Cache baselines while maintaining near-oracle accuracy across diverse tasks like LongBench, Needle-in-a-Haystack, and RULER. The approach offers a practical, architecture-agnostic improvement for prefill acceleration in long-context LLMs, with demonstrated robustness across models and contexts and clear guidance on hyperparameters such as the aggregation window and the number of initial uncompressed layers.
Abstract
The prefill stage in long-context LLM inference remains a computational bottleneck. Recent token-ranking heuristics accelerate inference by selectively processing a subset of semantically relevant tokens. However, existing methods suffer from unstable token importance estimation, often varying between layers. Evaluating token-ranking quality independently from heuristic-specific architectures is challenging. To address this, we introduce an Answer-Informed Oracle, which defines ground-truth token importance by measuring attention from generated answers back to the prompt. This oracle reveals that existing heuristics exhibit high variance across layers: rankings can degrade sharply at specific layers, a failure mode invisible to end-to-end benchmarks. The diagnosis suggests a simple fix: aggregate scores across layers rather than relying on any single one. We implement this as Cross-Layer Attention Aggregation (CLAA), which closes the gap to the oracle upper bound and reduces Time-to-First-Token (TTFT) by up to 39\% compared to the Full KV Cache baseline.
