Table of Contents
Fetching ...

FedMABench: Benchmarking Mobile Agents on Decentralized Heterogeneous User Data

Wenhao Wang, Zijie Yu, Rui Ye, Jianqing Zhang, Siheng Chen, Yanfeng Wang

TL;DR

This paper tackles the lack of standardized benchmarks for federated mobile agents trained on decentralized heterogeneous user data. It introduces FedMABench, a comprehensive benchmark with 6 datasets, 30+ subsets, 8 FL algorithms, and 10+ base models across 877 apps in 5 categories, plus an end-to-end framework for training and evaluation. Through extensive experiments, it shows federated approaches outperform local training, reveals that heterogeneity—especially app-name distribution—significantly shapes performance, and uncovers cross-category correlations that can influence learning dynamics. The work provides publicly accessible resources to drive fair comparisons and identifies open challenges, emphasizing the need for novel FL algorithms and privacy-preserving strategies tailored to mobile-agent training on real-world user data.

Abstract

Mobile agents have attracted tremendous research participation recently. Traditional approaches to mobile agent training rely on centralized data collection, leading to high cost and limited scalability. Distributed training utilizing federated learning offers an alternative by harnessing real-world user data, providing scalability and reducing costs. However, pivotal challenges, including the absence of standardized benchmarks, hinder progress in this field. To tackle the challenges, we introduce FedMABench, the first benchmark for federated training and evaluation of mobile agents, specifically designed for heterogeneous scenarios. FedMABench features 6 datasets with 30+ subsets, 8 federated algorithms, 10+ base models, and over 800 apps across 5 categories, providing a comprehensive framework for evaluating mobile agents across diverse environments. Through extensive experiments, we uncover several key insights: federated algorithms consistently outperform local training; the distribution of specific apps plays a crucial role in heterogeneity; and, even apps from distinct categories can exhibit correlations during training. FedMABench is publicly available at: https://github.com/wwh0411/FedMABench with the datasets at: https://huggingface.co/datasets/wwh0411/FedMABench.

FedMABench: Benchmarking Mobile Agents on Decentralized Heterogeneous User Data

TL;DR

This paper tackles the lack of standardized benchmarks for federated mobile agents trained on decentralized heterogeneous user data. It introduces FedMABench, a comprehensive benchmark with 6 datasets, 30+ subsets, 8 FL algorithms, and 10+ base models across 877 apps in 5 categories, plus an end-to-end framework for training and evaluation. Through extensive experiments, it shows federated approaches outperform local training, reveals that heterogeneity—especially app-name distribution—significantly shapes performance, and uncovers cross-category correlations that can influence learning dynamics. The work provides publicly accessible resources to drive fair comparisons and identifies open challenges, emphasizing the need for novel FL algorithms and privacy-preserving strategies tailored to mobile-agent training on real-world user data.

Abstract

Mobile agents have attracted tremendous research participation recently. Traditional approaches to mobile agent training rely on centralized data collection, leading to high cost and limited scalability. Distributed training utilizing federated learning offers an alternative by harnessing real-world user data, providing scalability and reducing costs. However, pivotal challenges, including the absence of standardized benchmarks, hinder progress in this field. To tackle the challenges, we introduce FedMABench, the first benchmark for federated training and evaluation of mobile agents, specifically designed for heterogeneous scenarios. FedMABench features 6 datasets with 30+ subsets, 8 federated algorithms, 10+ base models, and over 800 apps across 5 categories, providing a comprehensive framework for evaluating mobile agents across diverse environments. Through extensive experiments, we uncover several key insights: federated algorithms consistently outperform local training; the distribution of specific apps plays a crucial role in heterogeneity; and, even apps from distinct categories can exhibit correlations during training. FedMABench is publicly available at: https://github.com/wwh0411/FedMABench with the datasets at: https://huggingface.co/datasets/wwh0411/FedMABench.

Paper Structure

This paper contains 35 sections, 6 figures, 15 tables.

Figures (6)

  • Figure 1: Overview of FedMABench. FedMABench is tailored for benchmarking federated mobile agents trained on distributed mobile user data with diverse types of heterogeneity. To achieve this, we construct 2 homogeneous dataset and 4 heterogeneous datasets with 30+ subsets. We also build a research-friendly framework, which integrates 8 representative federated algorithms and supports evaluation on more than 10 base models. Our datasets covers 877 apps across 5 categories (bottom right) and the experiments (upper right) showcase that (1) federated mobile agents achieve promising results, surpassing GPT-4o by a large margin; (2) our datasets can reveal the performance differences of mobile agents on different distributions.
  • Figure 2: Distributions of episode and step counts within the Step-Episode Dataset. The four subsets highlight distinct differences in average steps per episode across clients. Note that the y-axes are not on the same scale for visualization purposes.
  • Figure 3: Distributions of the top 10 apps across five clients in the Category-Level Dataset. The top two apps from each of the five categories are selected. Our six subsets exhibit diverse patterns in terms of the apps and categories assigned to each client.
  • Figure 4: Distributions of the five apps across the App-Level Dataset. Our subsets reveal distinct differences in the heterogeneity of app usage. Note that the numbers represent episode counts, and the episodes are identical for all subsets.
  • Figure 5: Heatmap distribution of the ScaleApp Dataset. We select top 15 apps for visualization.
  • ...and 1 more figures