Agent.xpu: Efficient Scheduling of Agentic LLM Workloads on Heterogeneous SoC

Xinming Wei; Jiahao Zhang; Haoran Li; Jiayu Chen; Haoning Guan; Rui Qu; Maoliang Li; Xiang Chen; Guojie Luo

Agent.xpu: Efficient Scheduling of Agentic LLM Workloads on Heterogeneous SoC

Xinming Wei, Jiahao Zhang, Haoran Li, Jiayu Chen, Haoning Guan, Rui Qu, Maoliang Li, Xiang Chen, Guojie Luo

TL;DR

This paper tackles the challenge of running concurrent, flow-aware agentic LLM workloads on commodity hetero-SoCs by introducing Agent.xpu, an engine that orchestrates reactive and proactive flows using a heterogeneous execution graph (HEG) and flow-aware coordination with stage elasticity. The system decouples prefill and decode across NPU and iGPU to mitigate DDR bandwidth contention, while offering fine-grained preemption and slack-aware piggybacking to guarantee responsive user interactions without starving background tasks. Empirical results on Intel Core Ultra platforms show up to $1.2$–$4.9×$ proactive throughput gains, reactive latency reductions of at least $91 ext{%}$, and substantial energy savings and iGPU utilization reductions compared to baselines. These findings demonstrate a practical path to efficient, private, on-device personal agents that leverage heterogeneous accelerators through flow-aware scheduling and dynamic kernel binding.

Abstract

Personal LLM agents increasingly combine foreground reactive interactions with background proactive monitoring, forming long-lived, stateful LLM flows that interleave prefill and token-by-token decode. While modern heterogeneous SoCs integrate CPUs, iGPUs, and NPUs to support on-device intelligence, existing LLM engines assume static, single-shot inference and lack mechanisms for flow-level concurrency, prioritization, and efficient accelerator coordination. As a result, commodity SoCs remain poorly matched to the dynamic, mixed-criticality execution patterns of personal agents. This paper presents Agent$.$xpu, the first LLM engine that orchestrates concurrent reactive and proactive LLM flows on commodity SoCs. Extensive profiling uncovers unique SoC characteristics of operator-accelerator affinity, asymmetric DDR contention, and stage-divergent batching behaviors distinct from cloud-serving assumptions. Agent$.$xpu introduces three key techniques: a heterogeneous execution graph (HEG) capturing NPU/iGPU affinity and elastic operator binding; flow-aware NPU-iGPU coordination with stage elasticity, decoupling prefill and decode to reduce bandwidth contention and enforce priorities; and fine-grained preemption with slack-aware piggybacking to guarantee reactive responsiveness without starving proactive work. Across realistic personal-agent workloads, Agent$.$xpu delivers 1.2-4.9$\times$ proactive throughput and reduces reactive latency by at least 91%, compared with both industrial iGPU-only serving engine and NPU-iGPU static inference with optimal tensor-partitioning schemes. Agent$.$xpu also minimizes energy consumption and graphics interference via controlled iGPU usage.

Agent.xpu: Efficient Scheduling of Agentic LLM Workloads on Heterogeneous SoC

TL;DR

–

proactive throughput gains, reactive latency reductions of at least

, and substantial energy savings and iGPU utilization reductions compared to baselines. These findings demonstrate a practical path to efficient, private, on-device personal agents that leverage heterogeneous accelerators through flow-aware scheduling and dynamic kernel binding.

Abstract

xpu, the first LLM engine that orchestrates concurrent reactive and proactive LLM flows on commodity SoCs. Extensive profiling uncovers unique SoC characteristics of operator-accelerator affinity, asymmetric DDR contention, and stage-divergent batching behaviors distinct from cloud-serving assumptions. Agent

xpu introduces three key techniques: a heterogeneous execution graph (HEG) capturing NPU/iGPU affinity and elastic operator binding; flow-aware NPU-iGPU coordination with stage elasticity, decoupling prefill and decode to reduce bandwidth contention and enforce priorities; and fine-grained preemption with slack-aware piggybacking to guarantee reactive responsiveness without starving proactive work. Across realistic personal-agent workloads, Agent

xpu delivers 1.2-4.9

proactive throughput and reduces reactive latency by at least 91%, compared with both industrial iGPU-only serving engine and NPU-iGPU static inference with optimal tensor-partitioning schemes. Agent

xpu also minimizes energy consumption and graphics interference via controlled iGPU usage.

Agent.xpu: Efficient Scheduling of Agentic LLM Workloads on Heterogeneous SoC

TL;DR

Abstract

Agent.xpu: Efficient Scheduling of Agentic LLM Workloads on Heterogeneous SoC

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (13)