Table of Contents
Fetching ...

Hydragen: High-Throughput LLM Inference with Shared Prefixes

Jordan Juravsky, Bradley Brown, Ryan Ehrlich, Daniel Y. Fu, Christopher Ré, Azalia Mirhoseini

TL;DR

Hydragen delivers an exact, hardware-aware approach to attention when batched sequences share a common prefix. By decomposing attention into shared-prefix and unique-suffix components and performing inter-sequence batching for the prefix, it replaces many matrix-vector products with fewer matrix-matrix products, enabling tensor-core utilization. Empirical results show up to 32x end-to-end throughput gains over strong baselines, with modest sensitivity to prefix length and suffix size and strong gains in long-context and hierarchical-sharing scenarios. The method generalizes to tree-like sharing patterns and is implemented in PyTorch without custom kernels, suggesting broad portability and practical impact for large-scale LLM inference environments.

Abstract

Transformer-based large language models (LLMs) are now deployed to hundreds of millions of users. LLM inference is commonly performed on batches of sequences that share a prefix, such as few-shot examples or a chatbot system prompt. Decoding in this large-batch setting can be bottlenecked by the attention operation, which reads large key-value (KV) caches from memory and computes inefficient matrix-vector products for every sequence in the batch. In this work, we introduce Hydragen, a hardware-aware exact implementation of attention with shared prefixes. Hydragen computes attention over the shared prefix and unique suffixes separately. This decomposition enables efficient prefix attention by batching queries together across sequences, reducing redundant memory reads and enabling the use of hardware-friendly matrix multiplications. Our method can improve end-to-end CodeLlama-13b throughput by up to 32x against competitive baselines, with speedup growing with the batch size and shared prefix length. Hydragen also enables the use of very long shared contexts: with a large batch size, increasing the prefix length from 1K to 16K tokens decreases Hydragen throughput by less than 15%, while the throughput of baselines drops by over 90%. Hydragen generalizes beyond simple prefix-suffix decomposition and can be applied to tree-based prompt sharing patterns, allowing us to further reduce inference time on competitive programming problems by 55%.

Hydragen: High-Throughput LLM Inference with Shared Prefixes

TL;DR

Hydragen delivers an exact, hardware-aware approach to attention when batched sequences share a common prefix. By decomposing attention into shared-prefix and unique-suffix components and performing inter-sequence batching for the prefix, it replaces many matrix-vector products with fewer matrix-matrix products, enabling tensor-core utilization. Empirical results show up to 32x end-to-end throughput gains over strong baselines, with modest sensitivity to prefix length and suffix size and strong gains in long-context and hierarchical-sharing scenarios. The method generalizes to tree-like sharing patterns and is implemented in PyTorch without custom kernels, suggesting broad portability and practical impact for large-scale LLM inference environments.

Abstract

Transformer-based large language models (LLMs) are now deployed to hundreds of millions of users. LLM inference is commonly performed on batches of sequences that share a prefix, such as few-shot examples or a chatbot system prompt. Decoding in this large-batch setting can be bottlenecked by the attention operation, which reads large key-value (KV) caches from memory and computes inefficient matrix-vector products for every sequence in the batch. In this work, we introduce Hydragen, a hardware-aware exact implementation of attention with shared prefixes. Hydragen computes attention over the shared prefix and unique suffixes separately. This decomposition enables efficient prefix attention by batching queries together across sequences, reducing redundant memory reads and enabling the use of hardware-friendly matrix multiplications. Our method can improve end-to-end CodeLlama-13b throughput by up to 32x against competitive baselines, with speedup growing with the batch size and shared prefix length. Hydragen also enables the use of very long shared contexts: with a large batch size, increasing the prefix length from 1K to 16K tokens decreases Hydragen throughput by less than 15%, while the throughput of baselines drops by over 90%. Hydragen generalizes beyond simple prefix-suffix decomposition and can be applied to tree-based prompt sharing patterns, allowing us to further reduce inference time on competitive programming problems by 55%.
Paper Structure (30 sections, 7 equations, 7 figures, 6 tables)

This paper contains 30 sections, 7 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Left: An example LLM inference scenario where a chatbot model processes many sequences that share a large shared prefix (the system prompt). Middle: An overview of Hydragen, where overall attention is decomposed into attention over the shared prefix (batched across all queries in a batch) and attention over the remaining suffixes (independent across sequences, as is normally done). Top Right: Hydragen's attention decomposition allows many matrix vector products to be replaced with fewer matrix-matrix products. Bottom Right: Using matrix-matrix products is particularly important as GPUs dedicate an increasingly large ratio of their total FLOPS to tensor cores that are specialized in matrix multiplication.
  • Figure 2: An example of a batch of sequences with a hierarchical sharing pattern. This diagram depicts the setting of Section \ref{['sec:exps-hierarchy']}, which solves competitive programming problems using a few-shot prompt and by sampling many candidate solutions per problem. The few-shot prompt (orange) is globally shared across all sequences in the batch. However, the descriptions of each problem (green and blue) are only shared across the candidate solutions corresponding to that problem.
  • Figure 3: Left: End-to-end decoding throughput in tokens per second (TPS) with CodeLlama-13b when generating multiple completions from a prompt containing 2048 tokens. An "x" indicates that FlashAttention does not have enough memory to run. As the batch size grows, Hydragen achieves a significantly higher throughput than vLLM baselines. Throughput with Hydragen always remains within 50% of the upper bound where attention is entirely removed from the model. Details are in Section \ref{['sec:e2e']}. Right: Comparing decoding throughput of CodeLlama-13b between Hydragen, vLLM (with and without tokenization), and "No Attention", where the attention operation is removed from the model to demonstrate the throughput ceiling. In this scenario where the batch size is fixed, Hydragen improves throughput by up to 32x over the best baseline, with speedups increasing with prefix length.
  • Figure 4: Measuring the speedup of Hydragen attention over FlashAttention across various batch sizes, shared prefix lengths and suffix lengths on a single A100-40GB GPU. We see that Hydragen results in faster inference in all cases, in particular when the ratio of shared length to unique length is high and the batch size is large. We observe even larger performance gains when running on an L40S (a GPU with a higher compute-to-bandwidth ratio than an A100), shown in in Figure \ref{['fig:micro_moregpu']}.
  • Figure 5: Time to answer questions about a 19947 token-long document when benchmarking Yi-6B-200k on four A100-40GB GPUs. An "x" indicates that FlashAttention does not have enough memory to run. Time to process prefix is excluded.
  • ...and 2 more figures