Table of Contents
Fetching ...

RelayAttention for Efficient Large Language Model Serving with Long System Prompts

Lei Zhu, Xinjiang Wang, Wayne Zhang, Rynson W. H. Lau

TL;DR

The paper tackles the throughput and latency bottleneck in LLM serving caused by long shared system prompts during causal attention. It introduces RelayAttention, a mathematically reformulated attention algorithm that reads system KVs from DRAM only once per batch, eliminating redundant IO without retraining. The approach yields substantial end-to-end speedups in production systems (notably vLLM) and scales with longer prompts across multiple models and GPUs. This IO-aware method reduces hosting costs and enables longer prefixes or prefix-tuning in practical deployments. The work demonstrates both theoretical speedups and tangible, model- and data-center-scale improvements for real-world LLM serving workloads.

Abstract

A practical large language model (LLM) service may involve a long system prompt, which specifies the instructions, examples, and knowledge documents of the task and is reused across requests. However, the long system prompt causes throughput/latency bottlenecks as the cost of generating the next token grows w.r.t. the sequence length. This paper aims to improve the efficiency of LLM services that involve long system prompts. Our key observation is that handling these system prompts requires heavily redundant memory accesses in existing causal attention computation algorithms. Specifically, for batched requests, the cached hidden states (\ie, key-value pairs) of system prompts are transferred from off-chip DRAM to on-chip SRAM multiple times, each corresponding to an individual request. To eliminate such a redundancy, we propose RelayAttention, an attention algorithm that allows reading these hidden states from DRAM exactly once for a batch of input tokens. RelayAttention is a free lunch: it maintains the generation quality while requiring no model retraining, as it is based on a mathematical reformulation of causal attention. We have observed significant performance improvements to a production-level system, vLLM, through integration with RelayAttention. The improvements are even more profound with longer system prompts.

RelayAttention for Efficient Large Language Model Serving with Long System Prompts

TL;DR

The paper tackles the throughput and latency bottleneck in LLM serving caused by long shared system prompts during causal attention. It introduces RelayAttention, a mathematically reformulated attention algorithm that reads system KVs from DRAM only once per batch, eliminating redundant IO without retraining. The approach yields substantial end-to-end speedups in production systems (notably vLLM) and scales with longer prompts across multiple models and GPUs. This IO-aware method reduces hosting costs and enables longer prefixes or prefix-tuning in practical deployments. The work demonstrates both theoretical speedups and tangible, model- and data-center-scale improvements for real-world LLM serving workloads.

Abstract

A practical large language model (LLM) service may involve a long system prompt, which specifies the instructions, examples, and knowledge documents of the task and is reused across requests. However, the long system prompt causes throughput/latency bottlenecks as the cost of generating the next token grows w.r.t. the sequence length. This paper aims to improve the efficiency of LLM services that involve long system prompts. Our key observation is that handling these system prompts requires heavily redundant memory accesses in existing causal attention computation algorithms. Specifically, for batched requests, the cached hidden states (\ie, key-value pairs) of system prompts are transferred from off-chip DRAM to on-chip SRAM multiple times, each corresponding to an individual request. To eliminate such a redundancy, we propose RelayAttention, an attention algorithm that allows reading these hidden states from DRAM exactly once for a batch of input tokens. RelayAttention is a free lunch: it maintains the generation quality while requiring no model retraining, as it is based on a mathematical reformulation of causal attention. We have observed significant performance improvements to a production-level system, vLLM, through integration with RelayAttention. The improvements are even more profound with longer system prompts.
Paper Structure (20 sections, 12 equations, 10 figures, 4 tables, 1 algorithm)

This paper contains 20 sections, 12 equations, 10 figures, 4 tables, 1 algorithm.

Figures (10)

  • Figure 1: Llama-30B attention inference latency w.r.t. system prompt length (A40 GPU, batch size 32). We set the length of (request-specific) contexts, which include user prompts and previously generated tokens, to 128.
  • Figure 2: A system prompt may include instructions, knowledge documents and few-shot examples. Here, we use the shopping assistant as an example application.
  • Figure 3: A decoding step during the autoregressive generation phase. On the right side, we provide a closer view of the attention computation with IO-awareness. Note that the floating operations are executed in the fast on-chip SRAM, while the KVs are cached in the slow off-chip DRAM. As highlighted with the dashed boxes and red arrows, (1) the computation mainly involves matrix-vector multiplications; and (2) while being shared by all requests, the system KVs are transferred from DRAM to SRAM multiple times, each for one request.
  • Figure 4: The computation of RelayAttention. It is a mathematical reformulation of casual attention in \ref{['fig:gemv_decoding']}, but load the System KVs exactly once for a batch of requests (highlighted with red arrows).
  • Figure 5: Key modifications (high-lighted in red in the bottom) to integrate RelayAttention into an existing LLM serving system (top).
  • ...and 5 more figures