Table of Contents
Fetching ...

Mobile-Agent-RAG: Driving Smart Multi-Agent Coordination with Contextual Knowledge Empowerment for Long-Horizon Mobile Automation

Yuxiang Zhou, Jichang Li, Yanhao Zhang, Haonan Lu, Guanbin Li

TL;DR

Mobile-Agent-RAG tackles long-horizon, multi-app mobile automation by grounding both planning and execution in external knowledge through two retrieval-augmented components. Manager-RAG provides human-validated high-level task plans to reduce strategic hallucinations, while Operator-RAG supplies precise UI-grounded guidance for atomic actions. Two retrieval-oriented knowledge bases support cross-app planning and app-specific execution, respectively, and Mobile-Eval-RAG offers a challenging benchmark for evaluation. Experiments show substantial improvements in task completion, operator accuracy, and efficiency, demonstrating a robust context-aware paradigm for reliable mobile automation.

Abstract

Mobile agents show immense potential, yet current state-of-the-art (SoTA) agents exhibit inadequate success rates on real-world, long-horizon, cross-application tasks. We attribute this bottleneck to the agents' excessive reliance on static, internal knowledge within MLLMs, which leads to two critical failure points: 1) strategic hallucinations in high-level planning and 2) operational errors during low-level execution on user interfaces (UI). The core insight of this paper is that high-level planning and low-level UI operations require fundamentally distinct types of knowledge. Planning demands high-level, strategy-oriented experiences, whereas operations necessitate low-level, precise instructions closely tied to specific app UIs. Motivated by these insights, we propose Mobile-Agent-RAG, a novel hierarchical multi-agent framework that innovatively integrates dual-level retrieval augmentation. At the planning stage, we introduce Manager-RAG to reduce strategic hallucinations by retrieving human-validated comprehensive task plans that provide high-level guidance. At the execution stage, we develop Operator-RAG to improve execution accuracy by retrieving the most precise low-level guidance for accurate atomic actions, aligned with the current app and subtask. To accurately deliver these knowledge types, we construct two specialized retrieval-oriented knowledge bases. Furthermore, we introduce Mobile-Eval-RAG, a challenging benchmark for evaluating such agents on realistic multi-app, long-horizon tasks. Extensive experiments demonstrate that Mobile-Agent-RAG significantly outperforms SoTA baselines, improving task completion rate by 11.0% and step efficiency by 10.2%, establishing a robust paradigm for context-aware, reliable multi-agent mobile automation.

Mobile-Agent-RAG: Driving Smart Multi-Agent Coordination with Contextual Knowledge Empowerment for Long-Horizon Mobile Automation

TL;DR

Mobile-Agent-RAG tackles long-horizon, multi-app mobile automation by grounding both planning and execution in external knowledge through two retrieval-augmented components. Manager-RAG provides human-validated high-level task plans to reduce strategic hallucinations, while Operator-RAG supplies precise UI-grounded guidance for atomic actions. Two retrieval-oriented knowledge bases support cross-app planning and app-specific execution, respectively, and Mobile-Eval-RAG offers a challenging benchmark for evaluation. Experiments show substantial improvements in task completion, operator accuracy, and efficiency, demonstrating a robust context-aware paradigm for reliable mobile automation.

Abstract

Mobile agents show immense potential, yet current state-of-the-art (SoTA) agents exhibit inadequate success rates on real-world, long-horizon, cross-application tasks. We attribute this bottleneck to the agents' excessive reliance on static, internal knowledge within MLLMs, which leads to two critical failure points: 1) strategic hallucinations in high-level planning and 2) operational errors during low-level execution on user interfaces (UI). The core insight of this paper is that high-level planning and low-level UI operations require fundamentally distinct types of knowledge. Planning demands high-level, strategy-oriented experiences, whereas operations necessitate low-level, precise instructions closely tied to specific app UIs. Motivated by these insights, we propose Mobile-Agent-RAG, a novel hierarchical multi-agent framework that innovatively integrates dual-level retrieval augmentation. At the planning stage, we introduce Manager-RAG to reduce strategic hallucinations by retrieving human-validated comprehensive task plans that provide high-level guidance. At the execution stage, we develop Operator-RAG to improve execution accuracy by retrieving the most precise low-level guidance for accurate atomic actions, aligned with the current app and subtask. To accurately deliver these knowledge types, we construct two specialized retrieval-oriented knowledge bases. Furthermore, we introduce Mobile-Eval-RAG, a challenging benchmark for evaluating such agents on realistic multi-app, long-horizon tasks. Extensive experiments demonstrate that Mobile-Agent-RAG significantly outperforms SoTA baselines, improving task completion rate by 11.0% and step efficiency by 10.2%, establishing a robust paradigm for context-aware, reliable multi-agent mobile automation.

Paper Structure

This paper contains 52 sections, 9 figures, 17 tables, 2 algorithms.

Figures (9)

  • Figure 1: Mobile-Agent-RAG is a novel hierarchical multi-agent framework explicitly designed for long-horizon, multi-app mobile automation tasks. Mobile-Agent-RAG's proposed Manager-RAG retrieves human-verified task demonstrations to guide high-level plans, while Operator-RAG retrieves UI-grounded examples for atomic action generation.
  • Figure 2: Framework overview of Mobile-Agent-RAG: A hierarchical multi-agent system empowered by dedicated knowledge providers, namely Manager-RAG and Operatoer-RAG, for enhanced high-level planning and precise atomic action generation. Its core operational loop is illustrated: Manager plans, Operator executes, with Perceptor, Action Reflector, and Notetaker providing comprehensive support for mobile task automation.
  • Figure 3: A flow of knowledge base collection for Manager-RAG and Operater-RAG.
  • Figure 4: Comparison of performance gain sources for Mobile-Agent-RAG, Mobile-Agent-E, and Mobile-Agent-E+Evo across different MLLMs.
  • Figure 5: Ablation study results of Mobile-Agent-RAG, with (a) showing a radar chart comparing evaluation metrics across different ablation variants, and (b) presenting CR variations over 10 trials for each variant.
  • ...and 4 more figures