Characterizing CPU-Induced Slowdowns in Multi-GPU LLM Inference

Euijun Chung; Yuxiao Jia; Aaron Jezghani; Hyesoon Kim

Characterizing CPU-Induced Slowdowns in Multi-GPU LLM Inference

Euijun Chung, Yuxiao Jia, Aaron Jezghani, Hyesoon Kim

Abstract

Large-scale machine learning workloads increasingly rely on multi-GPU systems, yet their performance is often limited by an overlooked component: the CPU. Through a detailed study of modern large language model (LLM) inference and serving workloads, we find that multi-GPU performance frequently degrades not because GPUs are saturated, but because CPUs fail to keep the GPUs busy. Under limited CPU allocations, systems exhibit symptoms such as delayed kernel launch, stalled communication, and increased tokenization latency, leading to severe GPU underutilization even when ample GPU resources are available. This work presents a systematic analysis of CPU-induced slowdowns in multi-GPU LLM inference. We show that these bottlenecks persist even in serving stacks that employ process-level separation and modern GPU-side optimizations such as CUDA Graphs. Since the marginal cost of additional CPU cores is small relative to GPU instance pricing, our evaluation indicates that increasing the number of CPU cores can substantially improve performance and stability at minimal additional cost. Under moderate serving load, we observe that CPU-starved configurations frequently time out, while providing adequate CPU resources restores responsiveness and reduces time-to-first-token (TTFT) latency by 1.36-5.40x across configurations, all without requiring additional GPUs. This work shows that CPU provisioning is a crucial factor in multi-GPU LLM inference configuration, helping prevent control-side bottlenecks.

Characterizing CPU-Induced Slowdowns in Multi-GPU LLM Inference

Abstract

Paper Structure (22 sections, 13 figures, 1 table)

This paper contains 22 sections, 13 figures, 1 table.

Introduction
Background & Motivation
CPU's Job in LLM Inference
Real-World CPU Under-Provisioning in HPC Clusters
Multi-GPU System Evaluation Setup
CPU Bottleneck in LLM Inference
Tokenization Latency Evaluation in LLM Inference
Impact of Tokenization Load in LLM Serving
Understanding the CPU Bottlenecks in Multi-GPU Systems
Synchronization and CPU Oversubscription in Communication Kernels
Shared Memory Broadcast Contention
Discussion
CPU Under-Provisioning in Cloud Compute Platforms
Emerging Trends That May Intensify CPU Bottlenecks
Limitations
...and 7 more sections

Figures (13)

Figure 1: Overview of the CPU's jobs in multi-GPU LLM serving. Multithreaded input processing and per-GPU host processes prepare tokenized inputs, manage kernel launches, and handle synchronization.
Figure 2: Overview of the LLM inference pipeline, illustrating the CPU-intensive tokenization stage and GPU-intensive model computation stage.
Figure 3: CDF of users' CPU-to-GPU allocation ratios in the instructional cluster, weighted by GPU hours. Vertical lines mark percentiles; users manually set CPU and GPU counts.
Figure 4: CDF of users' CPU-to-GPU allocation ratios in the research cluster, weighted by GPU hours. Vertical lines mark percentiles; CPU cores are set proportionally to GPUs unless users override the setting.
Figure 5: Relative latency breakdown of tokenization and the time-to-first-token (TTFT) across varying batch sizes and sequence lengths (SL). CPU-side tokenization accounts for up to half of the total latency. Llama 3.1 8B on 4$\times$H200 system.
...and 8 more figures

Characterizing CPU-Induced Slowdowns in Multi-GPU LLM Inference

Abstract

Characterizing CPU-Induced Slowdowns in Multi-GPU LLM Inference

Authors

Abstract

Table of Contents

Figures (13)