Table of Contents
Fetching ...

HeRo: Adaptive Orchestration of Agentic RAG on Heterogeneous Mobile SoC

Maoliang Li, Jiayu Chen, Zihao Zheng, Ziqian Li, Xinhao Sun, Guojie Luo, Chenchen Liu, Xiang Chen

TL;DR

HeRo is presented, a heterogeneous-aware framework for low-latency agentic RAG on mobile SoCs that combines shape-aware sub-stage partitioning, criticality-based accelerator mapping, and bandwidth-aware concurrency control and reduces end-to-end latency by up to $10.94\times over existing deployment strategies, enabling practical on-device agentic RAG.

Abstract

With the increasing computational capability of mobile devices, deploying agentic retrieval-augmented generation (RAG) locally on heterogeneous System-on-Chips (SoCs) has become a promising way to enhance LLM-based applications. However, agentic RAG induces multi-stage workflows with heterogeneous models and dynamic execution flow, while mobile SoCs exhibit strong accelerator affinity, shape sensitivity, and shared-memory bandwidth contention, making naive scheduling ineffective. We present HeRo, a heterogeneous-aware framework for low-latency agentic RAG on mobile SoCs. HeRo builds profiling-based performance models for each sub-stage and model-PU configuration, capturing latency, workload shape, and contention-induced slowdown, and leverages them in a lightweight online scheduler that combines shape-aware sub-stage partitioning, criticality-based accelerator mapping, and bandwidth-aware concurrency control. Experiments on commercial mobile devices show that HeRo reduces end-to-end latency by up to $10.94\times$ over existing deployment strategies, enabling practical on-device agentic RAG.

HeRo: Adaptive Orchestration of Agentic RAG on Heterogeneous Mobile SoC

TL;DR

HeRo is presented, a heterogeneous-aware framework for low-latency agentic RAG on mobile SoCs that combines shape-aware sub-stage partitioning, criticality-based accelerator mapping, and bandwidth-aware concurrency control and reduces end-to-end latency by up to $10.94\times over existing deployment strategies, enabling practical on-device agentic RAG.

Abstract

With the increasing computational capability of mobile devices, deploying agentic retrieval-augmented generation (RAG) locally on heterogeneous System-on-Chips (SoCs) has become a promising way to enhance LLM-based applications. However, agentic RAG induces multi-stage workflows with heterogeneous models and dynamic execution flow, while mobile SoCs exhibit strong accelerator affinity, shape sensitivity, and shared-memory bandwidth contention, making naive scheduling ineffective. We present HeRo, a heterogeneous-aware framework for low-latency agentic RAG on mobile SoCs. HeRo builds profiling-based performance models for each sub-stage and model-PU configuration, capturing latency, workload shape, and contention-induced slowdown, and leverages them in a lightweight online scheduler that combines shape-aware sub-stage partitioning, criticality-based accelerator mapping, and bandwidth-aware concurrency control. Experiments on commercial mobile devices show that HeRo reduces end-to-end latency by up to over existing deployment strategies, enabling practical on-device agentic RAG.
Paper Structure (17 sections, 5 equations, 6 figures, 3 tables, 1 algorithm)

This paper contains 17 sections, 5 equations, 6 figures, 3 tables, 1 algorithm.

Figures (6)

  • Figure 1: RAG Application Optimization Stack. Our work is a hardware and graph co-scheduling framework.
  • Figure 2: Stage-Accelerator Affinity and Shape Sensitivity.
  • Figure 3: Contention Slowdown under Various Parallelism.
  • Figure 4: Orchestration Techniques.
  • Figure 5: End-to-End Latency on Qwen3 Model Family. Embed model: Qwen3-Embedding-0.6B, Rerank model: Qwen3-Reranker-0.6B, Search model: Qwen3-1.7B, Chat model: Qwen3-4B. All quantized to INT8.
  • ...and 1 more figures