Star Attention: Efficient LLM Inference over Long Sequences
Shantanu Acharya, Fei Jia, Boris Ginsburg
TL;DR
Star Attention tackles the quadratic self-attention bottleneck for long sequences with a two-phase block-sparse scheme that distributes context across multiple hosts and performs a global attention pass during token generation. Phase 1 encodes the long context using anchor blocks to approximate global attention with linear complexity, while Phase 2 uses a distributed global softmax to generate tokens and update KV caches. The approach achieves up to 11x speedups (and up to 16.9x at 1M tokens) with 97–100% accuracy relative to full global attention on several LLMs, and generalizes well across long-context benchmarks like RULER, BABILong, and InfiniteBench. It remains compatible with pretrained models without fine-tuning and integrates with Flash Attention for further acceleration, signaling strong practical impact for scalable LLM inference.
Abstract
Inference with Transformer-based Large Language Models (LLMs) on long sequences is both costly and slow due to the quadratic complexity of the self-attention mechanism. We introduce Star Attention, a two-phase block-sparse approximation that improves computational efficiency by sharding attention across multiple hosts while minimizing communication overhead. In the first phase, the context is processed using blockwise-local attention across hosts, in parallel. In the second phase, query and response tokens attend to all prior cached tokens through sequence-global attention. Star Attention integrates seamlessly with most Transformer-based LLMs trained with global attention, reducing memory requirements and inference time by up to 11x while preserving 97-100% of accuracy.
