LIMINAL: Exploring The Frontiers of LLM Decode Performance

Michael Davies; Neal Crago; Karthikeyan Sankaralingam; Christos Kozyrakis

LIMINAL: Exploring The Frontiers of LLM Decode Performance

Michael Davies, Neal Crago, Karthikeyan Sankaralingam, Christos Kozyrakis

TL;DR

LIMINAL tackles the problem of identifying fundamental hardware-bound limits on LLM decode performance by presenting a first-principles analytical model that links application-level transformer operations to hardware constraints. The approach decouples software behavior from hardware design to explore current, near-future, and hypothetical architectures, deriving closed-form latency and throughput expressions and validating them against real measurements with a mean absolute percent error of $7.6\%$. Key contributions include a unified modeling framework, explicit treatment of tensor- and pipeline-parallelism, MoE effects, and a detailed study of memory capacity, bandwidth, synchronization latency, and packaging on UTPS/STPS and efficiency, culminating in practical guidance for hardware-software co-design. The results show that while bandwidth upgrades (e.g., to 3D-DRAM) can yield large UTPS gains, achieving $>10{,}000$ UTPS will require simultaneous algorithmic innovations and architectural advances, underscoring the need for balanced, cross-layer optimization in future LLM serving systems.

Abstract

The rapid advancement of Large Language Models (LLMs) necessitates a deep understanding of their fundamental performance limits. This paper investigates the limits of LLM inference, focusing on hardware-imposed bottlenecks in auto-regressive decoding. We develop LIMINAL, an analytical performance model that abstracts application requirements and hardware capabilities to systematically explore performance and efficiency across a wide range of current, near-future, and hypothetical hardware. We find LIMINAL is accurate when comparing to LLMs executing on existing hardware, achieving a mean absolute error of $7.6\%$. Our analysis spans from current HBM3 memory technology used in AI accelerators like GPUs and TPUs to systems based on advanced HBM4 and advanced 3D-stacked DRAM technology. We identify five non-negotiable challenges for LLM inference hardware, establishing compute, memory capacity, bandwidth and collective communication as primary barriers to performance. These findings suggest that achieving significant performance gains beyond 10,000 tokens-per-second will require not just hardware evolution but also fundamental algorithmic advances.

LIMINAL: Exploring The Frontiers of LLM Decode Performance

TL;DR

. Key contributions include a unified modeling framework, explicit treatment of tensor- and pipeline-parallelism, MoE effects, and a detailed study of memory capacity, bandwidth, synchronization latency, and packaging on UTPS/STPS and efficiency, culminating in practical guidance for hardware-software co-design. The results show that while bandwidth upgrades (e.g., to 3D-DRAM) can yield large UTPS gains, achieving

UTPS will require simultaneous algorithmic innovations and architectural advances, underscoring the need for balanced, cross-layer optimization in future LLM serving systems.

Abstract

. Our analysis spans from current HBM3 memory technology used in AI accelerators like GPUs and TPUs to systems based on advanced HBM4 and advanced 3D-stacked DRAM technology. We identify five non-negotiable challenges for LLM inference hardware, establishing compute, memory capacity, bandwidth and collective communication as primary barriers to performance. These findings suggest that achieving significant performance gains beyond 10,000 tokens-per-second will require not just hardware evolution but also fundamental algorithmic advances.

LIMINAL: Exploring The Frontiers of LLM Decode Performance

TL;DR

Abstract

LIMINAL: Exploring The Frontiers of LLM Decode Performance

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)