HeRo: Adaptive Orchestration of Agentic RAG on Heterogeneous Mobile SoC
Maoliang Li, Jiayu Chen, Zihao Zheng, Ziqian Li, Xinhao Sun, Guojie Luo, Chenchen Liu, Xiang Chen
TL;DR
HeRo is presented, a heterogeneous-aware framework for low-latency agentic RAG on mobile SoCs that combines shape-aware sub-stage partitioning, criticality-based accelerator mapping, and bandwidth-aware concurrency control and reduces end-to-end latency by up to $10.94\times over existing deployment strategies, enabling practical on-device agentic RAG.
Abstract
With the increasing computational capability of mobile devices, deploying agentic retrieval-augmented generation (RAG) locally on heterogeneous System-on-Chips (SoCs) has become a promising way to enhance LLM-based applications. However, agentic RAG induces multi-stage workflows with heterogeneous models and dynamic execution flow, while mobile SoCs exhibit strong accelerator affinity, shape sensitivity, and shared-memory bandwidth contention, making naive scheduling ineffective. We present HeRo, a heterogeneous-aware framework for low-latency agentic RAG on mobile SoCs. HeRo builds profiling-based performance models for each sub-stage and model-PU configuration, capturing latency, workload shape, and contention-induced slowdown, and leverages them in a lightweight online scheduler that combines shape-aware sub-stage partitioning, criticality-based accelerator mapping, and bandwidth-aware concurrency control. Experiments on commercial mobile devices show that HeRo reduces end-to-end latency by up to $10.94\times$ over existing deployment strategies, enabling practical on-device agentic RAG.
