Table of Contents
Fetching ...

Mobile-Bench: An Evaluation Benchmark for LLM-based Mobile Agents

Shihan Deng, Weikai Xu, Hongda Sun, Wei Liu, Tao Tan, Jianfeng Liu, Ang Li, Jian Luan, Bin Wang, Rui Yan, Shuo Shang

TL;DR

This paper introduces Mobile-Bench, a benchmark and platform for evaluating LLM-based mobile agents using hybrid UI and API interactions. It expands task support with 103 APIs across 29 real apps and categorizes data into SAST, SAMT, and MAMT to probe planning and multi-app coordination, complemented by a CheckPoint-based evaluation metric. The experimental results reveal the impact of API usage and planning on task success and efficiency, and identify limitations such as API hallucinations and challenges in multi-app tasks. Overall, Mobile-Bench offers a realistic, scalable framework to advance the development and evaluation of mobile agents driven by large language models.

Abstract

With the remarkable advancements of large language models (LLMs), LLM-based agents have become a research hotspot in human-computer interaction. However, there is a scarcity of benchmarks available for LLM-based mobile agents. Benchmarking these agents generally faces three main challenges: (1) The inefficiency of UI-only operations imposes limitations to task evaluation. (2) Specific instructions within a singular application lack adequacy for assessing the multi-dimensional reasoning and decision-making capacities of LLM mobile agents. (3) Current evaluation metrics are insufficient to accurately assess the process of sequential actions. To this end, we propose Mobile-Bench, a novel benchmark for evaluating the capabilities of LLM-based mobile agents. First, we expand conventional UI operations by incorporating 103 collected APIs to accelerate the efficiency of task completion. Subsequently, we collect evaluation data by combining real user queries with augmentation from LLMs. To better evaluate different levels of planning capabilities for mobile agents, our data is categorized into three distinct groups: SAST, SAMT, and MAMT, reflecting varying levels of task complexity. Mobile-Bench comprises 832 data entries, with more than 200 tasks specifically designed to evaluate multi-APP collaboration scenarios. Furthermore, we introduce a more accurate evaluation metric, named CheckPoint, to assess whether LLM-based mobile agents reach essential points during their planning and reasoning steps.

Mobile-Bench: An Evaluation Benchmark for LLM-based Mobile Agents

TL;DR

This paper introduces Mobile-Bench, a benchmark and platform for evaluating LLM-based mobile agents using hybrid UI and API interactions. It expands task support with 103 APIs across 29 real apps and categorizes data into SAST, SAMT, and MAMT to probe planning and multi-app coordination, complemented by a CheckPoint-based evaluation metric. The experimental results reveal the impact of API usage and planning on task success and efficiency, and identify limitations such as API hallucinations and challenges in multi-app tasks. Overall, Mobile-Bench offers a realistic, scalable framework to advance the development and evaluation of mobile agents driven by large language models.

Abstract

With the remarkable advancements of large language models (LLMs), LLM-based agents have become a research hotspot in human-computer interaction. However, there is a scarcity of benchmarks available for LLM-based mobile agents. Benchmarking these agents generally faces three main challenges: (1) The inefficiency of UI-only operations imposes limitations to task evaluation. (2) Specific instructions within a singular application lack adequacy for assessing the multi-dimensional reasoning and decision-making capacities of LLM mobile agents. (3) Current evaluation metrics are insufficient to accurately assess the process of sequential actions. To this end, we propose Mobile-Bench, a novel benchmark for evaluating the capabilities of LLM-based mobile agents. First, we expand conventional UI operations by incorporating 103 collected APIs to accelerate the efficiency of task completion. Subsequently, we collect evaluation data by combining real user queries with augmentation from LLMs. To better evaluate different levels of planning capabilities for mobile agents, our data is categorized into three distinct groups: SAST, SAMT, and MAMT, reflecting varying levels of task complexity. Mobile-Bench comprises 832 data entries, with more than 200 tasks specifically designed to evaluate multi-APP collaboration scenarios. Furthermore, we introduce a more accurate evaluation metric, named CheckPoint, to assess whether LLM-based mobile agents reach essential points during their planning and reasoning steps.
Paper Structure (40 sections, 4 equations, 11 figures, 7 tables, 1 algorithm)

This paper contains 40 sections, 4 equations, 11 figures, 7 tables, 1 algorithm.

Figures (11)

  • Figure 1: For the task of "Setting an alarm for seven thirty.", accomplishing it solely through UI operations requires four steps, while API calls can achieve the same task in just one step.
  • Figure 2: A test case in MAMT. $\&$ stands for conjunction check, CC; $|$ stands for disjunction check, DC; $[ \ ]$ stands for sequential check, SC. The package CheckPoint passes when the action history includes either Amap and Ctrip Travel, or Amap and Qunar. Key phrase CheckPoint comes from the orange parts in the case.
  • Figure 3: (a) The API$\&$UI, UI task ratio. In SAST and SAMT, API$\&$UI task ratio is $85\%$, in MAMT, it is $100\%$. (b) The number of CheckPoints.
  • Figure 4: Test Platform Overview. The test platform is linked by the user, the simulator, and the Agent. After the user's instructions are issued, the entire test execution process is completed by the Agent, which can view and manage the test tasks through the preset interface in the cloud.
  • Figure 5: Baseline Model Overview. The entire process framework consists of sensors, reflection components, controllers, execution components, and environments. Once a task starts, these components will run iteratively until the task is completed or the maximum number of steps is reached.
  • ...and 6 more figures