Table of Contents
Fetching ...

MobileA3gent: Training Mobile GUI Agents Using Decentralized Self-Sourced Data from Diverse Users

Wenhao Wang, Mengying Yuan, Zijie Yu, Guangyi Liu, Rui Ye, Tian Jin, Siheng Chen, Yanfeng Wang

TL;DR

MobileA3gent tackles scalable training of mobile GUI agents by shifting from centralized data collection to distributed, self-sourced user data and replacing human annotation with Auto-Annotation. It couples Auto-Annotation with FedVLM-A, a federated training framework that uses Adapted Global Aggregation to handle two-level heterogeneity in episode- and step-level distributions, all while preserving user privacy. Across four benchmarks and 10+ models, the approach achieves near-centralized performance at approximately 1% of the cost and demonstrates strong data quality, generalization, and robustness under non-IID settings. The work offers a practical, privacy-preserving pathway to deploy capable mobile agents at scale.

Abstract

The advancement of mobile GUI agents has opened new opportunities for automating tasks on mobile devices. Training these agents requires large-scale high-quality data, which is prohibitively expensive when relying on human labor. Given the vast population of global mobile phone users, if automated data collection from them becomes feasible, the resulting data volume and the subsequently trained mobile agents could reach unprecedented levels. Nevertheless, two major challenges arise: (1) extracting user instructions without human intervention and (2) utilizing distributed user data while preserving privacy. To tackle these challenges, we propose MobileA3gent, a collaborative framework that trains mobile GUI Agents using decentralized self-sourced data from diverse users. The framework comprises two components, each targeting a specific challenge: (1) Auto-Annotation, which enables the automatic collection of high-quality datasets during users' routine phone usage with minimal cost. (2) FedVLM-A, which enhances federated VLM training under non-IID distributions by incorporating adapted global aggregation based on both episode-level and step-level variability. Extensive experiments prove that MobileA3gent achieves superior performance over traditional approaches at only 1% of the cost, highlighting its potential for real-world applications

MobileA3gent: Training Mobile GUI Agents Using Decentralized Self-Sourced Data from Diverse Users

TL;DR

MobileA3gent tackles scalable training of mobile GUI agents by shifting from centralized data collection to distributed, self-sourced user data and replacing human annotation with Auto-Annotation. It couples Auto-Annotation with FedVLM-A, a federated training framework that uses Adapted Global Aggregation to handle two-level heterogeneity in episode- and step-level distributions, all while preserving user privacy. Across four benchmarks and 10+ models, the approach achieves near-centralized performance at approximately 1% of the cost and demonstrates strong data quality, generalization, and robustness under non-IID settings. The work offers a practical, privacy-preserving pathway to deploy capable mobile agents at scale.

Abstract

The advancement of mobile GUI agents has opened new opportunities for automating tasks on mobile devices. Training these agents requires large-scale high-quality data, which is prohibitively expensive when relying on human labor. Given the vast population of global mobile phone users, if automated data collection from them becomes feasible, the resulting data volume and the subsequently trained mobile agents could reach unprecedented levels. Nevertheless, two major challenges arise: (1) extracting user instructions without human intervention and (2) utilizing distributed user data while preserving privacy. To tackle these challenges, we propose MobileA3gent, a collaborative framework that trains mobile GUI Agents using decentralized self-sourced data from diverse users. The framework comprises two components, each targeting a specific challenge: (1) Auto-Annotation, which enables the automatic collection of high-quality datasets during users' routine phone usage with minimal cost. (2) FedVLM-A, which enhances federated VLM training under non-IID distributions by incorporating adapted global aggregation based on both episode-level and step-level variability. Extensive experiments prove that MobileA3gent achieves superior performance over traditional approaches at only 1% of the cost, highlighting its potential for real-world applications

Paper Structure

This paper contains 42 sections, 9 equations, 18 figures, 12 tables.

Figures (18)

  • Figure 1: Comparing our proposed paradigm with conventional ones. By leveraging users' daily phone usage, we achieve superior scalability with drastic cost savings.
  • Figure 2: System overview of MobileA3gent. During individual users' daily phone usage, Auto-Annotation automatically constructs training data through step-wise description and episode-wide summarization. Each user then participates in FedVLM-A through our training integration. By applying adapted global aggregation, we obtain the target mobile agent with enhanced capabilities.
  • Figure 3: Performance and annotation cost trade-off on AndroidWorld.
  • Figure 4: Data quality evaluation across comprehensive metrics. Auto-Annotation outperforms all other baselines and achieve comparable quality to Human-Annotation with a nearly 80% similarity.
  • Figure 5: Comparison between FedVLM-A and 7 baselines on non-IID splits of AndroidControl. FedVLM-A achieves SOTA performance on average. Transparent bars indicate average scores over skewed scenarios only.
  • ...and 13 more figures