Characterizing and Optimizing LLM Inference Workloads on CPU-GPU Coupled Architectures

Prabhu Vellaisamy; Thomas Labonte; Sourav Chakraborty; Matt Turner; Samantika Sury; John Paul Shen

Characterizing and Optimizing LLM Inference Workloads on CPU-GPU Coupled Architectures

Prabhu Vellaisamy, Thomas Labonte, Sourav Chakraborty, Matt Turner, Samantika Sury, John Paul Shen

TL;DR

The paper addresses latency-sensitive LLM inference on diverse CPU-GPU coupling by introducing SKIP, a fine-grained profiler, and the TKLQT metric to distinguish CPU-bound and GPU-bound regimes. It analyzes operator–kernel offloads across loosely- and closely-coupled platforms, demonstrates that GH200 improves prefill latency at large batches but remains CPU-bound at smaller batches, and proposes a proximity-score-based kernel fusion framework to mitigate launch overhead. Key contributions include a systematic metric suite (TKLQT, AKD, IL, GPU Idle Time), a workload-classification method, and a scalable fusion recommendation approach that yields idealized speedups up to 2.7x–6.8x in CPU-bound regions. The work provides practical insights for optimizing CPU-GPU coupling strategies and lays groundwork for broader evaluation across future CC/TC architectures and workloads.

Abstract

Large language model (LLM)-based inference workloads increasingly dominate data center costs and resource utilization. Therefore, understanding the inference workload characteristics on evolving CPU-GPU coupled architectures is crucial for optimization. This paper presents an in-depth analysis of LLM inference behavior on loosely-coupled (PCIe A100/H100) and closely-coupled (GH200) systems. We analyze performance dynamics using fine-grained operator-to-kernel trace analysis, facilitated by our novel profiler SKIP and metrics like Total Kernel Launch and Queuing Time (TKLQT). Results show that closely-coupled (CC) GH200 significantly outperforms loosely-coupled (LC) systems at large batch sizes, achieving 1.9x-2.7x faster prefill latency for Llama 3.2-1B. However, our analysis also reveals that GH200 remains CPU-bound up to 4x larger batch sizes than LC systems. In this extended CPU-bound region, we identify the performance characteristics of the Grace CPU as a key factor contributing to higher inference latency at low batch sizes on GH200. We demonstrate that TKLQT accurately identifies this CPU/GPU-bound transition point. Based on this analysis, we further show that kernel fusion offers significant potential to mitigate GH200's low-batch latency bottleneck by reducing kernel launch overhead. This detailed kernel-level characterization provides critical insights for optimizing diverse CPU-GPU coupling strategies. This work is an initial effort, and we plan to explore other major AI/DL workloads that demand different degrees of CPU-GPU heterogeneous architectures.

Characterizing and Optimizing LLM Inference Workloads on CPU-GPU Coupled Architectures

TL;DR

Abstract

Characterizing and Optimizing LLM Inference Workloads on CPU-GPU Coupled Architectures

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (11)