A Systematic Characterization of LLM Inference on GPUs
Haonan Wang, Xuxin Xiao, Mingyu Yan, Zhuoyuan Zhu, Dengke Han, Duo Wang, Wenming Li, Xiaochun Ye, Cunchen Hu, Hongyang Chen, Guangyu Sun
TL;DR
This work presents a rigorous, multi-layered characterization of LLM inference on GPUs, introducing a four-dimensional framework that traces performance from macroscopic observations (two-phase heterogeneity) to microarchitectural root causes, system-scale scaling, and emerging paradigms like MoE and RAG. By integrating Roofline analysis, stall/memory-pattern profiling, and cross-platform experiments (data-center GPUs and edge devices), the authors reveal that Prefill is compute-bound while Decode is memory-bound, with bottlenecks migrating based on input length, context size, and workload. The study demonstrates phase-aware scaling principles, showing Tensor Parallelism best serves Prefill while Pipeline or single-GPU execution benefits Decode, and it highlights energy predictability driven predominantly by the Decode phase. Furthermore, it analyzes MoE and RAG as paradigms that redefine bottlenecks, providing actionable optimization guidelines for architecture-system co-design across cloud to edge environments.
Abstract
This work presents a systematic characterization of Large Language Model (LLM) inference to address fragmented understanding. Through comprehensive experiments, we establish a four-dimensional analytical framework: (1) Two-Phase Heterogeneity Observation; (2) Microarchitectural Root Cause Analysis; (3) System Scaling Principles; and (4) Emerging Paradigm Boundaries. Our investigation progresses systematically from observation to foresight: identifying performance phenomena, revealing hardware causes, validating system behavior, and exploring new paradigms. This study not only consolidates a reliable empirical foundation for existing research but also provides new discoveries and practical optimization guidance for LLM inference.
