Table of Contents
Fetching ...

A CPU-Centric Perspective on Agentic AI

Ritik Raj, Hong Wang, Tushar Krishna

TL;DR

Agentic AI combines external tools with LLMs to enable planning and action, but CPU-bound tool processing often dominates end-to-end performance. This work provides a CPU-centric characterization of agentic workloads along three orthogonal axes, profiles five representative workloads across latency, throughput, and energy, and identifies key CPU bottlenecks. It then introduces CGAM and MAWS scheduling techniques to improve latency, throughput, and energy efficiency, achieving substantial P50 latency speedups and tail-latency improvements. The findings highlight the importance of CPU-aware orchestration in scaling autonomous AI systems on commodity hardware.

Abstract

Agentic AI frameworks add a decision-making orchestrator embedded with external tools, including web search, Python interpreter, contextual database, and others, on top of monolithic LLMs, turning them from passive text oracles into autonomous problem-solvers that can plan, call tools, remember past steps, and adapt on the fly. This paper aims to characterize and understand the system bottlenecks introduced by agentic AI workloads from a largely overlooked CPU-centric perspective. We first systematically characterize Agentic AI on the basis of orchestrator/decision making component, inference path dynamics and repetitiveness of the agentic flow which directly influences the system-level performance. Thereafter, based on the characterization, we choose five representative agentic AI workloads- Haystack RAG, Toolformer, ChemCrow, Langchain and SWE-Agent to profile latency, throughput and energy metrics and demystify the significant impact of CPUs on these metrics relative to GPUs. We observe that - 1. Tool processing on CPUs can take up to 90.6% of the total latency; 2. Agentic throughput gets bottlenecked either by CPU factors - coherence, synchronization and over-subscription of cores or GPU factors - main memory capacity and bandwidth; \circled{3} CPU dynamic energy consumes up to 44% of the total dynamic energy at large batch sizes. Based on the profiling insights, we present two key optimizations- 1. CPU and GPU-Aware Micro-batching (CGAM) and 2. Mixed Agentic Workload Scheduling (MAWS) for homogeneous and heterogeneous agentic workloads respectively to demonstrate the potential to improve the performance, efficiency, and scalability of agentic AI. We achieve up to 2.1x and 1.41x P50 latency speedup compared to the multi-processing benchmark for homogeneous and heterogeneous agentic workloads respectively.

A CPU-Centric Perspective on Agentic AI

TL;DR

Agentic AI combines external tools with LLMs to enable planning and action, but CPU-bound tool processing often dominates end-to-end performance. This work provides a CPU-centric characterization of agentic workloads along three orthogonal axes, profiles five representative workloads across latency, throughput, and energy, and identifies key CPU bottlenecks. It then introduces CGAM and MAWS scheduling techniques to improve latency, throughput, and energy efficiency, achieving substantial P50 latency speedups and tail-latency improvements. The findings highlight the importance of CPU-aware orchestration in scaling autonomous AI systems on commodity hardware.

Abstract

Agentic AI frameworks add a decision-making orchestrator embedded with external tools, including web search, Python interpreter, contextual database, and others, on top of monolithic LLMs, turning them from passive text oracles into autonomous problem-solvers that can plan, call tools, remember past steps, and adapt on the fly. This paper aims to characterize and understand the system bottlenecks introduced by agentic AI workloads from a largely overlooked CPU-centric perspective. We first systematically characterize Agentic AI on the basis of orchestrator/decision making component, inference path dynamics and repetitiveness of the agentic flow which directly influences the system-level performance. Thereafter, based on the characterization, we choose five representative agentic AI workloads- Haystack RAG, Toolformer, ChemCrow, Langchain and SWE-Agent to profile latency, throughput and energy metrics and demystify the significant impact of CPUs on these metrics relative to GPUs. We observe that - 1. Tool processing on CPUs can take up to 90.6% of the total latency; 2. Agentic throughput gets bottlenecked either by CPU factors - coherence, synchronization and over-subscription of cores or GPU factors - main memory capacity and bandwidth; \circled{3} CPU dynamic energy consumes up to 44% of the total dynamic energy at large batch sizes. Based on the profiling insights, we present two key optimizations- 1. CPU and GPU-Aware Micro-batching (CGAM) and 2. Mixed Agentic Workload Scheduling (MAWS) for homogeneous and heterogeneous agentic workloads respectively to demonstrate the potential to improve the performance, efficiency, and scalability of agentic AI. We achieve up to 2.1x and 1.41x P50 latency speedup compared to the multi-processing benchmark for homogeneous and heterogeneous agentic workloads respectively.

Paper Structure

This paper contains 44 sections, 1 equation, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Characterization of agentic AI workloads on the basis of (a) Orchestrator (LLM and Host) (b) Agentic Path (Static and Dynamic) and (c) Repetitiveness (Single-step and Multi-step)
  • Figure 2: (a) Haystack with ENNS retrieval on QA benchmarks (b) Toolformer with WolframAlpha API on Math benchmarks (c) Chemcrow with literature (Arxiv/Pubmed) search tool on Chemistry benchmarks (d) Langchain with web search and LexRank summarization tools on QA benchmarks (e) Mini-SWE-Agent with bash/Python execution tools on coding benchmarks
  • Figure 3: Comparison of multi-processing and multi-threading with sequential baseline (single core) for Langchain workload
  • Figure 4: (a) vLLM throughput saturation for GPT-OSS-20B model (b) Throughput saturation for various agentic workloads (c) Average time taken by different components in Langchain benchmark showing a critical CPU context switching bottleneck at batch size 128
  • Figure 5: CPU (AMD Threadripper) and GPU (Nvidia B200) dynamic energy consumption for Langchain workload
  • ...and 4 more figures