Table of Contents
Fetching ...

AME: An Efficient Heterogeneous Agentic Memory Engine for Smartphones

Xinkui Zhao, Qingyu Ma, Yifan Zhang, Hengxuan Lou, Guanjie Cheng, Shuiguang Deng, Jianwei Yin

TL;DR

The paper tackles the challenge of running continuously evolving agentic memory on smartphones, where on-device privacy and latency demand a memory backend that fits mobile hardware. It introduces AME, a co-designed engine that fuses a hardware-aware heterogeneous vector computation pipeline with a workload-aware vector index and memory scheduler, leveraging a unified memory fabric across CPU, GPU, and NPU. Key contributions include an NPU-side data adaptation layer for zero-CPU preprocessing, template-driven heterogeneous execution for queries, updates, and index rebuilds, and a windowed scheduling strategy. Empirical evaluation on Snapdragon 8-series shows up to $1.4\times$ query throughput at matched recall, up to $7\times$ faster index construction, and up to $6\times$ higher insertion throughput under concurrent workloads on HotpotQA with Llama 3-3B embeddings. These results demonstrate the viability of SoC-aware co-design for low-latency, privacy-preserving on-device memory systems for personalized agents.

Abstract

On-device agents on smartphones increasingly require continuously evolving memory to support personalized, context-aware, and long-term behaviors. To meet both privacy and responsiveness demands, user data is embedded as vectors and stored in a vector database for fast similarity search. However, most existing vector databases target server-class environments. When ported directly to smartphones, two gaps emerge: (G1) a mismatch between mobile SoC constraints and vector-database assumptions, including tight bandwidth budgets, limited on-chip memory, and stricter data type and layout constraints; and (G2) a workload mismatch, because on-device usage resembles a continuously learning memory, in which queries must coexist with frequent inserts, deletions, and ongoing index maintenance. To address these challenges, we propose AME, an on-device Agentic Memory Engine co-designed with modern smartphone SoCs. AME introduces two key techniques: (1) a hardware-aware, high-efficiency matrix pipeline that maximizes compute-unit utilization and exploits multi-level on-chip storage to sustain high throughput; and (2) a hardware- and workload-aware scheduling scheme that coordinates querying, insertion, and index rebuilding to minimize latency. We implement AME on Snapdragon 8-series SoCs and evaluate it on HotpotQA. In our experiments, AME improves query throughput by up to 1.4x at matched recall, achieves up to 7x faster index construction, and delivers up to 6x higher insertion throughput under concurrent query workloads.

AME: An Efficient Heterogeneous Agentic Memory Engine for Smartphones

TL;DR

The paper tackles the challenge of running continuously evolving agentic memory on smartphones, where on-device privacy and latency demand a memory backend that fits mobile hardware. It introduces AME, a co-designed engine that fuses a hardware-aware heterogeneous vector computation pipeline with a workload-aware vector index and memory scheduler, leveraging a unified memory fabric across CPU, GPU, and NPU. Key contributions include an NPU-side data adaptation layer for zero-CPU preprocessing, template-driven heterogeneous execution for queries, updates, and index rebuilds, and a windowed scheduling strategy. Empirical evaluation on Snapdragon 8-series shows up to query throughput at matched recall, up to faster index construction, and up to higher insertion throughput under concurrent workloads on HotpotQA with Llama 3-3B embeddings. These results demonstrate the viability of SoC-aware co-design for low-latency, privacy-preserving on-device memory systems for personalized agents.

Abstract

On-device agents on smartphones increasingly require continuously evolving memory to support personalized, context-aware, and long-term behaviors. To meet both privacy and responsiveness demands, user data is embedded as vectors and stored in a vector database for fast similarity search. However, most existing vector databases target server-class environments. When ported directly to smartphones, two gaps emerge: (G1) a mismatch between mobile SoC constraints and vector-database assumptions, including tight bandwidth budgets, limited on-chip memory, and stricter data type and layout constraints; and (G2) a workload mismatch, because on-device usage resembles a continuously learning memory, in which queries must coexist with frequent inserts, deletions, and ongoing index maintenance. To address these challenges, we propose AME, an on-device Agentic Memory Engine co-designed with modern smartphone SoCs. AME introduces two key techniques: (1) a hardware-aware, high-efficiency matrix pipeline that maximizes compute-unit utilization and exploits multi-level on-chip storage to sustain high throughput; and (2) a hardware- and workload-aware scheduling scheme that coordinates querying, insertion, and index rebuilding to minimize latency. We implement AME on Snapdragon 8-series SoCs and evaluate it on HotpotQA. In our experiments, AME improves query throughput by up to 1.4x at matched recall, achieves up to 7x faster index construction, and delivers up to 6x higher insertion throughput under concurrent query workloads.

Paper Structure

This paper contains 23 sections, 9 figures, 1 table.

Figures (9)

  • Figure 1: Motivation for the proposed AME: agentic memory stored in a vector database requires adaptation for efficient use on smartphones.
  • Figure 2: The architecture of the Snapdragon 8 Elite Gen 5.
  • Figure 3: NPU-side data adaptation and execution–transfer overlapping in AME. (a) SMT-based overlap of GEMM execution and DMA transfers with double-buffering in TCM. (b) FP32$\rightarrow$FP16 conversion and tile packing in HVX. (c) In-place transpose and layout conversion without extra DDR traffic. (d) FP16$\rightarrow$FP32 unpacking and conversion.
  • Figure 4: GEMM throughput heatmaps for CPU, GPU, and NPU in our optimized heterogeneous system.
  • Figure 5: Template-driven heterogeneous execution in AME: four representative templates: query, update, index, and query–update hybrid.
  • ...and 4 more figures