Glinthawk: A Two-Tiered Architecture for Offline LLM Inference
Pouya Hamadanian, Sadjad Fouladi
TL;DR
Glinthawk addresses the throughput and cost challenges of offline, batch-oriented LLM inference by decoupling the attention KV-cache from the main model computations. It proposes a two-tier architecture in which Tier-1 high-end accelerators handle non-attention operations while Tier-2, comprised of cheaper compute nodes, manages attention and KV-cache, enabling far larger effective batch sizes and scalable KV memory. Through end-to-end experiments and simulations on Llama2-family models, Glinthawk demonstrates substantial gains in throughput ($\sim$5.9x) and cost reductions ($\sim$2.8x) over paged-attention baselines, with even larger sequence-length benefits (up to $16.3\times$ throughput at $2.4\times$ cost). The approach shows resilience to moderate inter-tier latency and low inter-tier bandwidth requirements, making it practical for latency-tolerant, batch-processing workloads, and it opens avenues to pair with other optimizations and hardware setups for broader scalability.
Abstract
We introduce Glinthawk, an architecture for offline Large Language Model (LLM) inference. By leveraging a two-tiered structure, Glinthawk optimizes the utilization of the high-end accelerators ("Tier 1") by offloading the attention mechanism to lower-end compute tier ("Tier 2"). This separation allows the memory demand of the attention, known as the key-value cache, to scale independently from the model weights, enabling larger batch sizes and more efficient accelerator usage. Prototyped with NVIDIA T4 GPUs and standard CPU VMs, Glinthawk improves throughput by $5.9\times$ and reduces cost of generation by $2.8\times$, compared to paged attention baselines. For long sequence lengths, it achieves $16.3\times$ throughput improvement at $2.4\times$ less cost. Our evaluation shows that this architecture can tolerate moderate network latency with minimal performance degradation, making it highly effective for latency-tolerant, throughput-focused applications such as batch processing. The prototype is publicly available at https://github.com/microsoft/glinthawk.
