Table of Contents
Fetching ...

Characterizing and Optimizing LLM Inference Workloads on CPU-GPU Coupled Architectures

Prabhu Vellaisamy, Thomas Labonte, Sourav Chakraborty, Matt Turner, Samantika Sury, John Paul Shen

TL;DR

The paper addresses latency-sensitive LLM inference on diverse CPU-GPU coupling by introducing SKIP, a fine-grained profiler, and the TKLQT metric to distinguish CPU-bound and GPU-bound regimes. It analyzes operator–kernel offloads across loosely- and closely-coupled platforms, demonstrates that GH200 improves prefill latency at large batches but remains CPU-bound at smaller batches, and proposes a proximity-score-based kernel fusion framework to mitigate launch overhead. Key contributions include a systematic metric suite (TKLQT, AKD, IL, GPU Idle Time), a workload-classification method, and a scalable fusion recommendation approach that yields idealized speedups up to 2.7x–6.8x in CPU-bound regions. The work provides practical insights for optimizing CPU-GPU coupling strategies and lays groundwork for broader evaluation across future CC/TC architectures and workloads.

Abstract

Large language model (LLM)-based inference workloads increasingly dominate data center costs and resource utilization. Therefore, understanding the inference workload characteristics on evolving CPU-GPU coupled architectures is crucial for optimization. This paper presents an in-depth analysis of LLM inference behavior on loosely-coupled (PCIe A100/H100) and closely-coupled (GH200) systems. We analyze performance dynamics using fine-grained operator-to-kernel trace analysis, facilitated by our novel profiler SKIP and metrics like Total Kernel Launch and Queuing Time (TKLQT). Results show that closely-coupled (CC) GH200 significantly outperforms loosely-coupled (LC) systems at large batch sizes, achieving 1.9x-2.7x faster prefill latency for Llama 3.2-1B. However, our analysis also reveals that GH200 remains CPU-bound up to 4x larger batch sizes than LC systems. In this extended CPU-bound region, we identify the performance characteristics of the Grace CPU as a key factor contributing to higher inference latency at low batch sizes on GH200. We demonstrate that TKLQT accurately identifies this CPU/GPU-bound transition point. Based on this analysis, we further show that kernel fusion offers significant potential to mitigate GH200's low-batch latency bottleneck by reducing kernel launch overhead. This detailed kernel-level characterization provides critical insights for optimizing diverse CPU-GPU coupling strategies. This work is an initial effort, and we plan to explore other major AI/DL workloads that demand different degrees of CPU-GPU heterogeneous architectures.

Characterizing and Optimizing LLM Inference Workloads on CPU-GPU Coupled Architectures

TL;DR

The paper addresses latency-sensitive LLM inference on diverse CPU-GPU coupling by introducing SKIP, a fine-grained profiler, and the TKLQT metric to distinguish CPU-bound and GPU-bound regimes. It analyzes operator–kernel offloads across loosely- and closely-coupled platforms, demonstrates that GH200 improves prefill latency at large batches but remains CPU-bound at smaller batches, and proposes a proximity-score-based kernel fusion framework to mitigate launch overhead. Key contributions include a systematic metric suite (TKLQT, AKD, IL, GPU Idle Time), a workload-classification method, and a scalable fusion recommendation approach that yields idealized speedups up to 2.7x–6.8x in CPU-bound regions. The work provides practical insights for optimizing CPU-GPU coupling strategies and lays groundwork for broader evaluation across future CC/TC architectures and workloads.

Abstract

Large language model (LLM)-based inference workloads increasingly dominate data center costs and resource utilization. Therefore, understanding the inference workload characteristics on evolving CPU-GPU coupled architectures is crucial for optimization. This paper presents an in-depth analysis of LLM inference behavior on loosely-coupled (PCIe A100/H100) and closely-coupled (GH200) systems. We analyze performance dynamics using fine-grained operator-to-kernel trace analysis, facilitated by our novel profiler SKIP and metrics like Total Kernel Launch and Queuing Time (TKLQT). Results show that closely-coupled (CC) GH200 significantly outperforms loosely-coupled (LC) systems at large batch sizes, achieving 1.9x-2.7x faster prefill latency for Llama 3.2-1B. However, our analysis also reveals that GH200 remains CPU-bound up to 4x larger batch sizes than LC systems. In this extended CPU-bound region, we identify the performance characteristics of the Grace CPU as a key factor contributing to higher inference latency at low batch sizes on GH200. We demonstrate that TKLQT accurately identifies this CPU/GPU-bound transition point. Based on this analysis, we further show that kernel fusion offers significant potential to mitigate GH200's low-batch latency bottleneck by reducing kernel launch overhead. This detailed kernel-level characterization provides critical insights for optimizing diverse CPU-GPU coupling strategies. This work is an initial effort, and we plan to explore other major AI/DL workloads that demand different degrees of CPU-GPU heterogeneous architectures.

Paper Structure

This paper contains 24 sections, 8 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: Evolution of CPU-GPU coupling paradigm. On the left, traditional data center architectures employ PCIe interconnects between discrete CPUs and GPUs, each maintaining separate memory pools. These systems are loosely-coupled (LC). In the center, closely-coupled (CC) architectures combine CPUs and GPUs on the same board, enabling unified memory access despite physically separate memories, and employ high-speed interconnects for data movement. On the right, tightly-coupled (TC) architecture integrates the PUs within the same package and possesses a shared unified physical memory.
  • Figure 2: Types of Kernel Fusion. The figure depicts kernel sequence $k_1, k_2, k_3,...,k_n$ in a GPU stream, triggered by CPU operators. From left to right: (1) Kernel-to-kernel offload (eager mode, unfused), (2) Domain-specific operator fusion (e.g., FlashAttention fusing self-attention operators), and (3) Entire graph offload (e.g. torch.compile/CUDA Graphs fusing larger subgraphs/whole graph).
  • Figure 3: TTFT speedups for FlashAttention-2 and torch.compile max-autotune mode for various 7B decoder models (compared to eager mode execution). Evaluation platform is Intel Xeon Platinum connected to NVIDIA H100 over PCIe Gen5.
  • Figure 4: Operator-kernel execution timing. Illustrates how CPU-side operators trigger GPU kernel execution, with the launch latency ($t_l$) illustrated as the latency between start of CPU launch call $l$ to start of execution of kernel $k$.
  • Figure 5: Increasing computation in each GPU kernel allows the CPU to queue up many kernels while hiding the offload latency. However, too much GPU computation results in large CPU idle times and long latencies to batch completion, translating to user-visible response latency.
  • ...and 6 more figures