Table of Contents
Fetching ...

LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention

Shang Yang, Junxian Guo, Haotian Tang, Qinghao Hu, Guangxuan Xiao, Jiaming Tang, Yujun Lin, Zhijian Liu, Yao Lu, Song Han

TL;DR

LServe tackles the bottlenecks of long-context LLM serving by unifying block-sparse attention into a single framework that accelerates both prefilling and decoding. It combines static sparsity via streaming heads with dynamic sparsity through hierarchical KV paging, supported by core kernel innovations and a reusable page-selector to minimize overhead. Empirical results show up to 2.9x prefilling and 1.3–2.1x decoding speedups across multiple models and GPUs, while preserving long-context performance on benchmarks like LongBench and NIAH. The approach offers a practical, end-to-end improvement over state-of-the-art systems, enabling efficient long-context reasoning and document analysis at scale.

Abstract

Large language models (LLMs) have shown remarkable potential in processing long sequences and complex reasoning tasks, yet efficiently serving these models remains challenging due to the quadratic computational complexity of attention in the prefilling stage and the large memory footprint of the KV cache in the decoding stage. To address these issues, we introduce LServe, an efficient system that accelerates long-sequence LLM serving via hybrid sparse attention. This method unifies different hardware-friendly, structured sparsity patterns for both prefilling and decoding attention into a single framework, where computations on less important tokens are skipped block-wise. LServe demonstrates the compatibility of static and dynamic sparsity in long-context LLM attention. This design enables multiplicative speedups by combining these optimizations. Specifically, we convert half of the attention heads to nearly free streaming heads in both the prefilling and decoding stages. Additionally, we find that only a constant number of KV pages is required to preserve long-context and reasoning capabilities, irrespective of context length. We then design a hierarchical KV page selection policy that dynamically prunes KV pages based on query-centric similarity. On average, LServe accelerates LLM prefilling by up to 2.9x and decoding by 1.3-2.1x over vLLM, maintaining long-context accuracy. Code is released at https://github.com/mit-han-lab/omniserve.

LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention

TL;DR

LServe tackles the bottlenecks of long-context LLM serving by unifying block-sparse attention into a single framework that accelerates both prefilling and decoding. It combines static sparsity via streaming heads with dynamic sparsity through hierarchical KV paging, supported by core kernel innovations and a reusable page-selector to minimize overhead. Empirical results show up to 2.9x prefilling and 1.3–2.1x decoding speedups across multiple models and GPUs, while preserving long-context performance on benchmarks like LongBench and NIAH. The approach offers a practical, end-to-end improvement over state-of-the-art systems, enabling efficient long-context reasoning and document analysis at scale.

Abstract

Large language models (LLMs) have shown remarkable potential in processing long sequences and complex reasoning tasks, yet efficiently serving these models remains challenging due to the quadratic computational complexity of attention in the prefilling stage and the large memory footprint of the KV cache in the decoding stage. To address these issues, we introduce LServe, an efficient system that accelerates long-sequence LLM serving via hybrid sparse attention. This method unifies different hardware-friendly, structured sparsity patterns for both prefilling and decoding attention into a single framework, where computations on less important tokens are skipped block-wise. LServe demonstrates the compatibility of static and dynamic sparsity in long-context LLM attention. This design enables multiplicative speedups by combining these optimizations. Specifically, we convert half of the attention heads to nearly free streaming heads in both the prefilling and decoding stages. Additionally, we find that only a constant number of KV pages is required to preserve long-context and reasoning capabilities, irrespective of context length. We then design a hierarchical KV page selection policy that dynamically prunes KV pages based on query-centric similarity. On average, LServe accelerates LLM prefilling by up to 2.9x and decoding by 1.3-2.1x over vLLM, maintaining long-context accuracy. Code is released at https://github.com/mit-han-lab/omniserve.

Paper Structure

This paper contains 44 sections, 2 equations, 16 figures, 8 tables.

Figures (16)

  • Figure 1: LServe is an efficient system for serving long-sequence LLMs that leverages hybrid sparse attention. With the unification of different sparse patterns as well as KV cache quantization, LServe achieves significant speedups in both prefilling stage and decoding stage while also reducing the memory consumption.
  • Figure 2: Latency breakdown of LLM inference during prefilling and decoding stages. Attention dominates both stages as sequence length increases, due to its quadratic complexity in prefilling and linear complexity in decoding. GEMM exhibits linear complexity in prefilling and constant complexity in decoding. Measurements obtained with Llama-3-8B on NVIDIA A100 GPU.
  • Figure 3: Attention calculation on GPUs: In both the decoding and prefilling stages, each query token iterates over all key and value tokens sequentially in a block-by-block manner. Skipping KV blocks reduces the number of sequential iterations, directly accelerating attention.
  • Figure 4: Unified block sparse attention pattern. LServe integrates various sparsity patterns into a unified framework.
  • Figure 5: LServe system overview. In prefilling stage, LServe processes both dense heads and streaming heads within a fused sparse attention kernel. Past Keys and Values are stored in two separate paging systems: one for streaming heads and the other for dense heads. In decoding stage, LServe applies dynamic sparsity on dense heads with a page selection procedure. Only selected KV Pages will be loaded for the decoding stage attention. We omit normalization layers and residual connections in this figure for the sake of simplicity.
  • ...and 11 more figures