Table of Contents
Fetching ...

Star Attention: Efficient LLM Inference over Long Sequences

Shantanu Acharya, Fei Jia, Boris Ginsburg

TL;DR

Star Attention tackles the quadratic self-attention bottleneck for long sequences with a two-phase block-sparse scheme that distributes context across multiple hosts and performs a global attention pass during token generation. Phase 1 encodes the long context using anchor blocks to approximate global attention with linear complexity, while Phase 2 uses a distributed global softmax to generate tokens and update KV caches. The approach achieves up to 11x speedups (and up to 16.9x at 1M tokens) with 97–100% accuracy relative to full global attention on several LLMs, and generalizes well across long-context benchmarks like RULER, BABILong, and InfiniteBench. It remains compatible with pretrained models without fine-tuning and integrates with Flash Attention for further acceleration, signaling strong practical impact for scalable LLM inference.

Abstract

Inference with Transformer-based Large Language Models (LLMs) on long sequences is both costly and slow due to the quadratic complexity of the self-attention mechanism. We introduce Star Attention, a two-phase block-sparse approximation that improves computational efficiency by sharding attention across multiple hosts while minimizing communication overhead. In the first phase, the context is processed using blockwise-local attention across hosts, in parallel. In the second phase, query and response tokens attend to all prior cached tokens through sequence-global attention. Star Attention integrates seamlessly with most Transformer-based LLMs trained with global attention, reducing memory requirements and inference time by up to 11x while preserving 97-100% of accuracy.

Star Attention: Efficient LLM Inference over Long Sequences

TL;DR

Star Attention tackles the quadratic self-attention bottleneck for long sequences with a two-phase block-sparse scheme that distributes context across multiple hosts and performs a global attention pass during token generation. Phase 1 encodes the long context using anchor blocks to approximate global attention with linear complexity, while Phase 2 uses a distributed global softmax to generate tokens and update KV caches. The approach achieves up to 11x speedups (and up to 16.9x at 1M tokens) with 97–100% accuracy relative to full global attention on several LLMs, and generalizes well across long-context benchmarks like RULER, BABILong, and InfiniteBench. It remains compatible with pretrained models without fine-tuning and integrates with Flash Attention for further acceleration, signaling strong practical impact for scalable LLM inference.

Abstract

Inference with Transformer-based Large Language Models (LLMs) on long sequences is both costly and slow due to the quadratic complexity of the self-attention mechanism. We introduce Star Attention, a two-phase block-sparse approximation that improves computational efficiency by sharding attention across multiple hosts while minimizing communication overhead. In the first phase, the context is processed using blockwise-local attention across hosts, in parallel. In the second phase, query and response tokens attend to all prior cached tokens through sequence-global attention. Star Attention integrates seamlessly with most Transformer-based LLMs trained with global attention, reducing memory requirements and inference time by up to 11x while preserving 97-100% of accuracy.

Paper Structure

This paper contains 21 sections, 7 equations, 8 figures, 10 tables, 2 algorithms.

Figures (8)

  • Figure 1: Star Attention inference flow across two phases. (a) Context Encoding: The input context is partitioned into blocks and distributed across hosts, where each block (except the first) is prefixed with the anchor block ($c_1$). Each host processes its assigned block and stores the non-anchor portion of the KV cache. (b) Query Encoding and Token Generation: The query is broadcast to all hosts, which compute local attention using cached KVs. A designated "query" host then aggregates softmax normalization statistics to compute global attention and generates the next token.
  • Figure 2: Block sparsity pattern in Star Attention for a sequence partitioned into 5 context blocks $c_i$ and a query block $q$. Each context block attends only to itself and the "anchor block" whereas the query attends to the entire input.
  • Figure 3: Attention distribution across the sequence during context encoding under different strategies in Phase 1. (a) Global attention exhibits a single attention sink at the sequence start. (b) Without anchor blocks, blockwise context encoding creates multiple attention sinks at the start of each block. (c) With anchor blocks, attention sinks shift to anchor tokens, yielding a distribution that closely approximates global attention. The sequence is 4K tokens long and partitioned into 512-token chunks.
  • Figure 4: Accuracy comparison of Star Attention and Global Attention on RULER and BABILong from 16K to 128K sequence lengths using various models. All runs use a block and anchor block size set to one-quarter of the total sequence length. Star Attention maintains 97-100% of the accuracy of global attention, and in some cases, even outperform it.
  • Figure 5: Impact of context and anchor block sizes on the accuracy of Star Attention at 128K sequence length with Llama-3.1-8B Instruct. (a) Accuracy as a function of context block size, with anchor block size matched to it. (b) Accuracy as a function of anchor block size, with context block size fixed at 32K. Larger block sizes yield consistent accuracy improvements, highlighting the benefit of broader receptive fields for long-context understanding.
  • ...and 3 more figures