Agent.xpu: Efficient Scheduling of Agentic LLM Workloads on Heterogeneous SoC
Xinming Wei, Jiahao Zhang, Haoran Li, Jiayu Chen, Haoning Guan, Rui Qu, Maoliang Li, Xiang Chen, Guojie Luo
TL;DR
This paper tackles the challenge of running concurrent, flow-aware agentic LLM workloads on commodity hetero-SoCs by introducing Agent.xpu, an engine that orchestrates reactive and proactive flows using a heterogeneous execution graph (HEG) and flow-aware coordination with stage elasticity. The system decouples prefill and decode across NPU and iGPU to mitigate DDR bandwidth contention, while offering fine-grained preemption and slack-aware piggybacking to guarantee responsive user interactions without starving background tasks. Empirical results on Intel Core Ultra platforms show up to $1.2$–$4.9×$ proactive throughput gains, reactive latency reductions of at least $91 ext{%}$, and substantial energy savings and iGPU utilization reductions compared to baselines. These findings demonstrate a practical path to efficient, private, on-device personal agents that leverage heterogeneous accelerators through flow-aware scheduling and dynamic kernel binding.
Abstract
Personal LLM agents increasingly combine foreground reactive interactions with background proactive monitoring, forming long-lived, stateful LLM flows that interleave prefill and token-by-token decode. While modern heterogeneous SoCs integrate CPUs, iGPUs, and NPUs to support on-device intelligence, existing LLM engines assume static, single-shot inference and lack mechanisms for flow-level concurrency, prioritization, and efficient accelerator coordination. As a result, commodity SoCs remain poorly matched to the dynamic, mixed-criticality execution patterns of personal agents. This paper presents Agent$.$xpu, the first LLM engine that orchestrates concurrent reactive and proactive LLM flows on commodity SoCs. Extensive profiling uncovers unique SoC characteristics of operator-accelerator affinity, asymmetric DDR contention, and stage-divergent batching behaviors distinct from cloud-serving assumptions. Agent$.$xpu introduces three key techniques: a heterogeneous execution graph (HEG) capturing NPU/iGPU affinity and elastic operator binding; flow-aware NPU-iGPU coordination with stage elasticity, decoupling prefill and decode to reduce bandwidth contention and enforce priorities; and fine-grained preemption with slack-aware piggybacking to guarantee reactive responsiveness without starving proactive work. Across realistic personal-agent workloads, Agent$.$xpu delivers 1.2-4.9$\times$ proactive throughput and reduces reactive latency by at least 91%, compared with both industrial iGPU-only serving engine and NPU-iGPU static inference with optimal tensor-partitioning schemes. Agent$.$xpu also minimizes energy consumption and graphics interference via controlled iGPU usage.
