Table of Contents
Fetching ...

Agent.xpu: Efficient Scheduling of Agentic LLM Workloads on Heterogeneous SoC

Xinming Wei, Jiahao Zhang, Haoran Li, Jiayu Chen, Haoning Guan, Rui Qu, Maoliang Li, Xiang Chen, Guojie Luo

TL;DR

This paper tackles the challenge of running concurrent, flow-aware agentic LLM workloads on commodity hetero-SoCs by introducing Agent.xpu, an engine that orchestrates reactive and proactive flows using a heterogeneous execution graph (HEG) and flow-aware coordination with stage elasticity. The system decouples prefill and decode across NPU and iGPU to mitigate DDR bandwidth contention, while offering fine-grained preemption and slack-aware piggybacking to guarantee responsive user interactions without starving background tasks. Empirical results on Intel Core Ultra platforms show up to $1.2$–$4.9×$ proactive throughput gains, reactive latency reductions of at least $91 ext{%}$, and substantial energy savings and iGPU utilization reductions compared to baselines. These findings demonstrate a practical path to efficient, private, on-device personal agents that leverage heterogeneous accelerators through flow-aware scheduling and dynamic kernel binding.

Abstract

Personal LLM agents increasingly combine foreground reactive interactions with background proactive monitoring, forming long-lived, stateful LLM flows that interleave prefill and token-by-token decode. While modern heterogeneous SoCs integrate CPUs, iGPUs, and NPUs to support on-device intelligence, existing LLM engines assume static, single-shot inference and lack mechanisms for flow-level concurrency, prioritization, and efficient accelerator coordination. As a result, commodity SoCs remain poorly matched to the dynamic, mixed-criticality execution patterns of personal agents. This paper presents Agent$.$xpu, the first LLM engine that orchestrates concurrent reactive and proactive LLM flows on commodity SoCs. Extensive profiling uncovers unique SoC characteristics of operator-accelerator affinity, asymmetric DDR contention, and stage-divergent batching behaviors distinct from cloud-serving assumptions. Agent$.$xpu introduces three key techniques: a heterogeneous execution graph (HEG) capturing NPU/iGPU affinity and elastic operator binding; flow-aware NPU-iGPU coordination with stage elasticity, decoupling prefill and decode to reduce bandwidth contention and enforce priorities; and fine-grained preemption with slack-aware piggybacking to guarantee reactive responsiveness without starving proactive work. Across realistic personal-agent workloads, Agent$.$xpu delivers 1.2-4.9$\times$ proactive throughput and reduces reactive latency by at least 91%, compared with both industrial iGPU-only serving engine and NPU-iGPU static inference with optimal tensor-partitioning schemes. Agent$.$xpu also minimizes energy consumption and graphics interference via controlled iGPU usage.

Agent.xpu: Efficient Scheduling of Agentic LLM Workloads on Heterogeneous SoC

TL;DR

This paper tackles the challenge of running concurrent, flow-aware agentic LLM workloads on commodity hetero-SoCs by introducing Agent.xpu, an engine that orchestrates reactive and proactive flows using a heterogeneous execution graph (HEG) and flow-aware coordination with stage elasticity. The system decouples prefill and decode across NPU and iGPU to mitigate DDR bandwidth contention, while offering fine-grained preemption and slack-aware piggybacking to guarantee responsive user interactions without starving background tasks. Empirical results on Intel Core Ultra platforms show up to proactive throughput gains, reactive latency reductions of at least , and substantial energy savings and iGPU utilization reductions compared to baselines. These findings demonstrate a practical path to efficient, private, on-device personal agents that leverage heterogeneous accelerators through flow-aware scheduling and dynamic kernel binding.

Abstract

Personal LLM agents increasingly combine foreground reactive interactions with background proactive monitoring, forming long-lived, stateful LLM flows that interleave prefill and token-by-token decode. While modern heterogeneous SoCs integrate CPUs, iGPUs, and NPUs to support on-device intelligence, existing LLM engines assume static, single-shot inference and lack mechanisms for flow-level concurrency, prioritization, and efficient accelerator coordination. As a result, commodity SoCs remain poorly matched to the dynamic, mixed-criticality execution patterns of personal agents. This paper presents Agentxpu, the first LLM engine that orchestrates concurrent reactive and proactive LLM flows on commodity SoCs. Extensive profiling uncovers unique SoC characteristics of operator-accelerator affinity, asymmetric DDR contention, and stage-divergent batching behaviors distinct from cloud-serving assumptions. Agentxpu introduces three key techniques: a heterogeneous execution graph (HEG) capturing NPU/iGPU affinity and elastic operator binding; flow-aware NPU-iGPU coordination with stage elasticity, decoupling prefill and decode to reduce bandwidth contention and enforce priorities; and fine-grained preemption with slack-aware piggybacking to guarantee reactive responsiveness without starving proactive work. Across realistic personal-agent workloads, Agentxpu delivers 1.2-4.9 proactive throughput and reduces reactive latency by at least 91%, compared with both industrial iGPU-only serving engine and NPU-iGPU static inference with optimal tensor-partitioning schemes. Agentxpu also minimizes energy consumption and graphics interference via controlled iGPU usage.

Paper Structure

This paper contains 23 sections, 11 equations, 13 figures, 2 tables, 2 algorithms.

Figures (13)

  • Figure 1: Personal LLM Agent System. Agent.xpu bridges the agent applications and heterogeneous SoC, orchestrating stateful on-device LLM flows from both foreground reactive agents and background proactive agents.
  • Figure 2: Shared-Memory Hetero-SoC. iGPU builds upon thread-level execution unit (EU), while NPU adopts multiply-accumulate (MAC) array for efficient tensor operations.
  • Figure 3: Schematic Roofline Illustration of LLM Ops.
  • Figure 4: Memory Contention Analysis. Changes of execution time (upper) and DDR bandwidth (lower) from standalone NPU/iGPU kernel running to simultaneous co-execution. Memory-bound GEMV kernels are more sensitive to NPU/iGPU parallelism than compute-bound GEMM.
  • Figure 5: Individual Task Latency in Batching. Distinctive batching effects of prefill and decode on Llama-3B model.
  • ...and 8 more figures