Table of Contents
Fetching ...

A Training-free Sub-quadratic Cost Transformer Model Serving Framework With Hierarchically Pruned Attention

Heejun Lee, Geon Park, Youngwan Lee, Jaduk Suh, Jina Kim, Wonyoung Jeong, Bumsik Kim, Hyemin Lee, Myeongjae Jeon, Sung Ju Hwang

TL;DR

The paper addresses the bottleneck of quadratic attention in long-context transformers by introducing HiP, a training-free Hierarchically Pruned Attention that achieves $O(T \log T)$ time and $O(T)$ space through a locality-driven top-$k$ estimation. It combines a tree-like top-$k$ mask estimation with block-wise MMU-friendly tiling and a KV-cache offloading system to extend GPU-context capacity to tens of thousands of tokens while preserving generation quality. Empirical results on Llama3.1-8B across PG19, LongBench, and passkey tasks demonstrate substantial speedups in prefill and decoding, with end-to-end decoding gains up to $6.83\times$ at 128k context and context extensions up to 64k–512k via offloading. The method remains training-free and plug-and-play, promising practical deployment of ultra-long-context LLMs on commodity hardware for applications like long document QA, multi-agent chats, and retrieval-augmented reasoning.

Abstract

In modern large language models (LLMs), increasing the context length is crucial for improving comprehension and coherence in long-context, multi-modal, and retrieval-augmented language generation. While many recent transformer models attempt to extend their context length over a million tokens, they remain impractical due to the quadratic time and space complexities. Although recent works on linear and sparse attention mechanisms can achieve this goal, their real-world applicability is often limited by the need to re-train from scratch and significantly worse performance. In response, we propose a novel approach, Hierarchically Pruned Attention (HiP), which reduces the time complexity of the attention mechanism to $O(T \log T)$ and the space complexity to $O(T)$, where $T$ is the sequence length. We notice a pattern in the attention scores of pretrained LLMs where tokens close together tend to have similar scores, which we call ``attention locality''. Based on this observation, we utilize a novel tree-search-like algorithm that estimates the top-$k$ key tokens for a given query on the fly, which is mathematically guaranteed to have better performance than random attention pruning. In addition to improving the time complexity of the attention mechanism, we further optimize GPU memory usage by implementing KV cache offloading, which stores only $O(\log T)$ tokens on the GPU while maintaining similar decoding throughput. Experiments on benchmarks show that HiP, with its training-free nature, significantly reduces both prefill and decoding latencies, as well as memory usage, while maintaining high-quality generation with minimal degradation. HiP enables pretrained LLMs to scale up to millions of tokens on commodity GPUs, potentially unlocking long-context LLM applications previously deemed infeasible.

A Training-free Sub-quadratic Cost Transformer Model Serving Framework With Hierarchically Pruned Attention

TL;DR

The paper addresses the bottleneck of quadratic attention in long-context transformers by introducing HiP, a training-free Hierarchically Pruned Attention that achieves time and space through a locality-driven top- estimation. It combines a tree-like top- mask estimation with block-wise MMU-friendly tiling and a KV-cache offloading system to extend GPU-context capacity to tens of thousands of tokens while preserving generation quality. Empirical results on Llama3.1-8B across PG19, LongBench, and passkey tasks demonstrate substantial speedups in prefill and decoding, with end-to-end decoding gains up to at 128k context and context extensions up to 64k–512k via offloading. The method remains training-free and plug-and-play, promising practical deployment of ultra-long-context LLMs on commodity hardware for applications like long document QA, multi-agent chats, and retrieval-augmented reasoning.

Abstract

In modern large language models (LLMs), increasing the context length is crucial for improving comprehension and coherence in long-context, multi-modal, and retrieval-augmented language generation. While many recent transformer models attempt to extend their context length over a million tokens, they remain impractical due to the quadratic time and space complexities. Although recent works on linear and sparse attention mechanisms can achieve this goal, their real-world applicability is often limited by the need to re-train from scratch and significantly worse performance. In response, we propose a novel approach, Hierarchically Pruned Attention (HiP), which reduces the time complexity of the attention mechanism to and the space complexity to , where is the sequence length. We notice a pattern in the attention scores of pretrained LLMs where tokens close together tend to have similar scores, which we call ``attention locality''. Based on this observation, we utilize a novel tree-search-like algorithm that estimates the top- key tokens for a given query on the fly, which is mathematically guaranteed to have better performance than random attention pruning. In addition to improving the time complexity of the attention mechanism, we further optimize GPU memory usage by implementing KV cache offloading, which stores only tokens on the GPU while maintaining similar decoding throughput. Experiments on benchmarks show that HiP, with its training-free nature, significantly reduces both prefill and decoding latencies, as well as memory usage, while maintaining high-quality generation with minimal degradation. HiP enables pretrained LLMs to scale up to millions of tokens on commodity GPUs, potentially unlocking long-context LLM applications previously deemed infeasible.
Paper Structure (52 sections, 9 theorems, 80 equations, 31 figures, 18 tables, 2 algorithms)

This paper contains 52 sections, 9 theorems, 80 equations, 31 figures, 18 tables, 2 algorithms.

Key Result

Theorem 1

Consider the case of finding the location of the top-$1$ key token with the maximum attention score in a context of $T$ tokens. Suppose that our locality assumption holds true. We divide the context into two branches with $T/2$ keys each. Then, the branch whose center token has the bigger attention

Figures (31)

  • Figure 1: HiP Attention. HiP dynamically prunes block sparse attention depending on a given query token in sub-quadratic cost by utilizing the hierarchy and locality of natural language.
  • Figure 1: Passkey Results. We evaluate our proposed HiP and baselines using passkey retrieval which is a needle in a haystack style context utilization benchmark.
  • Figure 2: Overview of our HiP attention mechanism. In HiP, the model dynamically decides which $k$ number of key tokens to attend to for each query by generating a sparse attention mask. The sparse attention mask is generated in a tree search-like manner. At each iteration, the top-$k$ blocks with the largest attention scores are selected, and the rest of the branches are discarded. The final mask becomes an accurate approximation of the top-$k$ blocks of the true attention map. Please refer to \ref{['fig:appendix_flow']} for a more detailed illustration.
  • Figure 2: RULER Results. We compare the effective context lengths of HiP and baselines with Llama3.1-8B. Accuracies surpassing 80% are marked with bold font.
  • Figure 3: Flow of KV Cache Offloading with HiP.
  • ...and 26 more figures

Theorems & Definitions (27)

  • Theorem 1: Informal
  • Lemma 1
  • Claim 1
  • Claim 2
  • Claim 3
  • Claim 4
  • Lemma 2
  • proof : Proof (sketch)
  • Lemma 3
  • proof : Proof (sketch)
  • ...and 17 more