Table of Contents
Fetching ...

A Systematic Characterization of LLM Inference on GPUs

Haonan Wang, Xuxin Xiao, Mingyu Yan, Zhuoyuan Zhu, Dengke Han, Duo Wang, Wenming Li, Xiaochun Ye, Cunchen Hu, Hongyang Chen, Guangyu Sun

TL;DR

This work presents a rigorous, multi-layered characterization of LLM inference on GPUs, introducing a four-dimensional framework that traces performance from macroscopic observations (two-phase heterogeneity) to microarchitectural root causes, system-scale scaling, and emerging paradigms like MoE and RAG. By integrating Roofline analysis, stall/memory-pattern profiling, and cross-platform experiments (data-center GPUs and edge devices), the authors reveal that Prefill is compute-bound while Decode is memory-bound, with bottlenecks migrating based on input length, context size, and workload. The study demonstrates phase-aware scaling principles, showing Tensor Parallelism best serves Prefill while Pipeline or single-GPU execution benefits Decode, and it highlights energy predictability driven predominantly by the Decode phase. Furthermore, it analyzes MoE and RAG as paradigms that redefine bottlenecks, providing actionable optimization guidelines for architecture-system co-design across cloud to edge environments.

Abstract

This work presents a systematic characterization of Large Language Model (LLM) inference to address fragmented understanding. Through comprehensive experiments, we establish a four-dimensional analytical framework: (1) Two-Phase Heterogeneity Observation; (2) Microarchitectural Root Cause Analysis; (3) System Scaling Principles; and (4) Emerging Paradigm Boundaries. Our investigation progresses systematically from observation to foresight: identifying performance phenomena, revealing hardware causes, validating system behavior, and exploring new paradigms. This study not only consolidates a reliable empirical foundation for existing research but also provides new discoveries and practical optimization guidance for LLM inference.

A Systematic Characterization of LLM Inference on GPUs

TL;DR

This work presents a rigorous, multi-layered characterization of LLM inference on GPUs, introducing a four-dimensional framework that traces performance from macroscopic observations (two-phase heterogeneity) to microarchitectural root causes, system-scale scaling, and emerging paradigms like MoE and RAG. By integrating Roofline analysis, stall/memory-pattern profiling, and cross-platform experiments (data-center GPUs and edge devices), the authors reveal that Prefill is compute-bound while Decode is memory-bound, with bottlenecks migrating based on input length, context size, and workload. The study demonstrates phase-aware scaling principles, showing Tensor Parallelism best serves Prefill while Pipeline or single-GPU execution benefits Decode, and it highlights energy predictability driven predominantly by the Decode phase. Furthermore, it analyzes MoE and RAG as paradigms that redefine bottlenecks, providing actionable optimization guidelines for architecture-system co-design across cloud to edge environments.

Abstract

This work presents a systematic characterization of Large Language Model (LLM) inference to address fragmented understanding. Through comprehensive experiments, we establish a four-dimensional analytical framework: (1) Two-Phase Heterogeneity Observation; (2) Microarchitectural Root Cause Analysis; (3) System Scaling Principles; and (4) Emerging Paradigm Boundaries. Our investigation progresses systematically from observation to foresight: identifying performance phenomena, revealing hardware causes, validating system behavior, and exploring new paradigms. This study not only consolidates a reliable empirical foundation for existing research but also provides new discoveries and practical optimization guidance for LLM inference.

Paper Structure

This paper contains 40 sections, 18 figures, 5 tables.

Figures (18)

  • Figure 1: Illustration of the LLM inference process, comprising the Prefill and Decode phases.
  • Figure 2: Overview of Emerging Paradigms: (a) MoE architecture; (b) RAG workflow.
  • Figure 3: (a) Overview of SM utilization and DRAM throughput across Prefill and Decode in the Chat scenario. (b) Transition of latency dominance from Decode to Prefill phase under increasing input length and fixed output (128 tokens), measured on Qwen2.5-7B with a single GPU.
  • Figure 4: Operator-level bottleneck migration: transition between FFN-dominated and Attention-dominated latency as context length varies: (a) Llama-3-8B. (b) Qwen2.5-32B.
  • Figure 5: Characterizing the throughput versus latency trade-off under varying batch sizes for models (a) Qwen2.5-7B, (b) Qwen2.5-32B, and (c) Qwen3-30B-A3B.
  • ...and 13 more figures