Table of Contents
Fetching ...

LIMINAL: Exploring The Frontiers of LLM Decode Performance

Michael Davies, Neal Crago, Karthikeyan Sankaralingam, Christos Kozyrakis

TL;DR

LIMINAL tackles the problem of identifying fundamental hardware-bound limits on LLM decode performance by presenting a first-principles analytical model that links application-level transformer operations to hardware constraints. The approach decouples software behavior from hardware design to explore current, near-future, and hypothetical architectures, deriving closed-form latency and throughput expressions and validating them against real measurements with a mean absolute percent error of $7.6\%$. Key contributions include a unified modeling framework, explicit treatment of tensor- and pipeline-parallelism, MoE effects, and a detailed study of memory capacity, bandwidth, synchronization latency, and packaging on UTPS/STPS and efficiency, culminating in practical guidance for hardware-software co-design. The results show that while bandwidth upgrades (e.g., to 3D-DRAM) can yield large UTPS gains, achieving $>10{,}000$ UTPS will require simultaneous algorithmic innovations and architectural advances, underscoring the need for balanced, cross-layer optimization in future LLM serving systems.

Abstract

The rapid advancement of Large Language Models (LLMs) necessitates a deep understanding of their fundamental performance limits. This paper investigates the limits of LLM inference, focusing on hardware-imposed bottlenecks in auto-regressive decoding. We develop LIMINAL, an analytical performance model that abstracts application requirements and hardware capabilities to systematically explore performance and efficiency across a wide range of current, near-future, and hypothetical hardware. We find LIMINAL is accurate when comparing to LLMs executing on existing hardware, achieving a mean absolute error of $7.6\%$. Our analysis spans from current HBM3 memory technology used in AI accelerators like GPUs and TPUs to systems based on advanced HBM4 and advanced 3D-stacked DRAM technology. We identify five non-negotiable challenges for LLM inference hardware, establishing compute, memory capacity, bandwidth and collective communication as primary barriers to performance. These findings suggest that achieving significant performance gains beyond 10,000 tokens-per-second will require not just hardware evolution but also fundamental algorithmic advances.

LIMINAL: Exploring The Frontiers of LLM Decode Performance

TL;DR

LIMINAL tackles the problem of identifying fundamental hardware-bound limits on LLM decode performance by presenting a first-principles analytical model that links application-level transformer operations to hardware constraints. The approach decouples software behavior from hardware design to explore current, near-future, and hypothetical architectures, deriving closed-form latency and throughput expressions and validating them against real measurements with a mean absolute percent error of . Key contributions include a unified modeling framework, explicit treatment of tensor- and pipeline-parallelism, MoE effects, and a detailed study of memory capacity, bandwidth, synchronization latency, and packaging on UTPS/STPS and efficiency, culminating in practical guidance for hardware-software co-design. The results show that while bandwidth upgrades (e.g., to 3D-DRAM) can yield large UTPS gains, achieving UTPS will require simultaneous algorithmic innovations and architectural advances, underscoring the need for balanced, cross-layer optimization in future LLM serving systems.

Abstract

The rapid advancement of Large Language Models (LLMs) necessitates a deep understanding of their fundamental performance limits. This paper investigates the limits of LLM inference, focusing on hardware-imposed bottlenecks in auto-regressive decoding. We develop LIMINAL, an analytical performance model that abstracts application requirements and hardware capabilities to systematically explore performance and efficiency across a wide range of current, near-future, and hypothetical hardware. We find LIMINAL is accurate when comparing to LLMs executing on existing hardware, achieving a mean absolute error of . Our analysis spans from current HBM3 memory technology used in AI accelerators like GPUs and TPUs to systems based on advanced HBM4 and advanced 3D-stacked DRAM technology. We identify five non-negotiable challenges for LLM inference hardware, establishing compute, memory capacity, bandwidth and collective communication as primary barriers to performance. These findings suggest that achieving significant performance gains beyond 10,000 tokens-per-second will require not just hardware evolution but also fundamental algorithmic advances.

Paper Structure

This paper contains 20 sections, 8 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Transformer layer of Llama3 showing a Tensor-Parallel-8 mapping. Red lines indicate operation split for 8-way parallelism. LIMINAL modeling uses properties of LLMs such as compute FLOPS, memory transfer (bytes), and communication latencies.
  • Figure 2: UTPS sensitivity to bandwidth. Solid blue vertical line denotes 30 TB/s.
  • Figure 3: UTPS and STPS/Watt across different hardware technologies for T=128K. Dashed line indicates max TPW achieved.
  • Figure 4: Scatter plot of measured TPOT vs modeled, and histogram of error.