Table of Contents
Fetching ...

Glinthawk: A Two-Tiered Architecture for Offline LLM Inference

Pouya Hamadanian, Sadjad Fouladi

TL;DR

Glinthawk addresses the throughput and cost challenges of offline, batch-oriented LLM inference by decoupling the attention KV-cache from the main model computations. It proposes a two-tier architecture in which Tier-1 high-end accelerators handle non-attention operations while Tier-2, comprised of cheaper compute nodes, manages attention and KV-cache, enabling far larger effective batch sizes and scalable KV memory. Through end-to-end experiments and simulations on Llama2-family models, Glinthawk demonstrates substantial gains in throughput ($\sim$5.9x) and cost reductions ($\sim$2.8x) over paged-attention baselines, with even larger sequence-length benefits (up to $16.3\times$ throughput at $2.4\times$ cost). The approach shows resilience to moderate inter-tier latency and low inter-tier bandwidth requirements, making it practical for latency-tolerant, batch-processing workloads, and it opens avenues to pair with other optimizations and hardware setups for broader scalability.

Abstract

We introduce Glinthawk, an architecture for offline Large Language Model (LLM) inference. By leveraging a two-tiered structure, Glinthawk optimizes the utilization of the high-end accelerators ("Tier 1") by offloading the attention mechanism to lower-end compute tier ("Tier 2"). This separation allows the memory demand of the attention, known as the key-value cache, to scale independently from the model weights, enabling larger batch sizes and more efficient accelerator usage. Prototyped with NVIDIA T4 GPUs and standard CPU VMs, Glinthawk improves throughput by $5.9\times$ and reduces cost of generation by $2.8\times$, compared to paged attention baselines. For long sequence lengths, it achieves $16.3\times$ throughput improvement at $2.4\times$ less cost. Our evaluation shows that this architecture can tolerate moderate network latency with minimal performance degradation, making it highly effective for latency-tolerant, throughput-focused applications such as batch processing. The prototype is publicly available at https://github.com/microsoft/glinthawk.

Glinthawk: A Two-Tiered Architecture for Offline LLM Inference

TL;DR

Glinthawk addresses the throughput and cost challenges of offline, batch-oriented LLM inference by decoupling the attention KV-cache from the main model computations. It proposes a two-tier architecture in which Tier-1 high-end accelerators handle non-attention operations while Tier-2, comprised of cheaper compute nodes, manages attention and KV-cache, enabling far larger effective batch sizes and scalable KV memory. Through end-to-end experiments and simulations on Llama2-family models, Glinthawk demonstrates substantial gains in throughput (5.9x) and cost reductions (2.8x) over paged-attention baselines, with even larger sequence-length benefits (up to throughput at cost). The approach shows resilience to moderate inter-tier latency and low inter-tier bandwidth requirements, making it practical for latency-tolerant, batch-processing workloads, and it opens avenues to pair with other optimizations and hardware setups for broader scalability.

Abstract

We introduce Glinthawk, an architecture for offline Large Language Model (LLM) inference. By leveraging a two-tiered structure, Glinthawk optimizes the utilization of the high-end accelerators ("Tier 1") by offloading the attention mechanism to lower-end compute tier ("Tier 2"). This separation allows the memory demand of the attention, known as the key-value cache, to scale independently from the model weights, enabling larger batch sizes and more efficient accelerator usage. Prototyped with NVIDIA T4 GPUs and standard CPU VMs, Glinthawk improves throughput by and reduces cost of generation by , compared to paged attention baselines. For long sequence lengths, it achieves throughput improvement at less cost. Our evaluation shows that this architecture can tolerate moderate network latency with minimal performance degradation, making it highly effective for latency-tolerant, throughput-focused applications such as batch processing. The prototype is publicly available at https://github.com/microsoft/glinthawk.
Paper Structure (60 sections, 6 equations, 15 figures, 7 tables)

This paper contains 60 sections, 6 equations, 15 figures, 7 tables.

Figures (15)

  • Figure 1: Glinthawk dissects attention memory and compute to a second cluster of low-end nodes, and improves the utilization of costly GPUs working on compute-heavy operations, improving the total system throughput and reducing inference costs.
  • Figure 2: Throughput gain vs. batch size, for the Llama2-70B transformer running on a NVIDIA T4 GPU.
  • Figure 3: Glinthawk's batching schedule. Glinthawk hides the inter-tier transit time by utilizing multiple inflight batches.
  • Figure 4: Inference throughput vs. setup cost for various inference engines with NVIDIA T4 GPUs as Tier 1 and AMD EPYC 7V12 16 Core CPUs as Tier 2.
  • Figure 5: (a) Inference throughput for various schemes using 16 NVIDIA T4 GPUs. Glinthawk extracts more throughput from high-end Tier 1 machines compared to baselines. (b) Time per token across different schemes. Glinthawk gains throughput at the cost of higher time per token, due to large batch sizes and using a second tier. (c) Breakdown of time per token. With more Tier 2 nodes, Glinthawk runs at higher batch sizes and more in-flight batches, increasing compute time and queuing in GPUs.
  • ...and 10 more figures