Table of Contents
Fetching ...

SPA-Bench: A Comprehensive Benchmark for SmartPhone Agent Evaluation

Jingxuan Chen, Derek Yuen, Bin Xie, Yuhao Yang, Gongwei Chen, Zhihao Wu, Li Yixing, Xurui Zhou, Weiwen Liu, Shuai Wang, Kaiwen Zhou, Rui Shao, Liqiang Nie, Yasheng Wang, Jianye Hao, Jun Wang, Kun Shao

TL;DR

SPA-Bench introduces a comprehensive benchmark for evaluating multimodal smartphone agents across English and Chinese apps and both single-app and cross-app tasks. It combines a large, diverse task suite with a plug-and-play agent framework and an automated, scalable evaluation pipeline utilizing seven metrics and hybrid success detection. Experimental results show that agentic workflow agents generally outperform agent-as-a-model approaches, but cross-app tasks, Chinese-language UI complexity, and real-world deployment constraints (time and cost) remain significant challenges. The work outlines concrete directions in UI grounding, memory-augmented reasoning, robust error handling, and efficiency improvements to accelerate the development of practical, user-friendly smartphone agents.

Abstract

Smartphone agents are increasingly important for helping users control devices efficiently, with (Multimodal) Large Language Model (MLLM)-based approaches emerging as key contenders. Fairly comparing these agents is essential but challenging, requiring a varied task scope, the integration of agents with different implementations, and a generalisable evaluation pipeline to assess their strengths and weaknesses. In this paper, we present SPA-Bench, a comprehensive SmartPhone Agent Benchmark designed to evaluate (M)LLM-based agents in an interactive environment that simulates real-world conditions. SPA-Bench offers three key contributions: (1) A diverse set of tasks covering system and third-party apps in both English and Chinese, focusing on features commonly used in daily routines; (2) A plug-and-play framework enabling real-time agent interaction with Android devices, integrating over ten agents with the flexibility to add more; (3) A novel evaluation pipeline that automatically assesses agent performance across multiple dimensions, encompassing seven metrics related to task completion and resource consumption. Our extensive experiments across tasks and agents reveal challenges like interpreting mobile user interfaces, action grounding, memory retention, and execution costs. We propose future research directions to ease these difficulties, moving closer to real-world smartphone agent applications. SPA-Bench is available at https://ai-agents-2030.github.io/SPA-Bench/.

SPA-Bench: A Comprehensive Benchmark for SmartPhone Agent Evaluation

TL;DR

SPA-Bench introduces a comprehensive benchmark for evaluating multimodal smartphone agents across English and Chinese apps and both single-app and cross-app tasks. It combines a large, diverse task suite with a plug-and-play agent framework and an automated, scalable evaluation pipeline utilizing seven metrics and hybrid success detection. Experimental results show that agentic workflow agents generally outperform agent-as-a-model approaches, but cross-app tasks, Chinese-language UI complexity, and real-world deployment constraints (time and cost) remain significant challenges. The work outlines concrete directions in UI grounding, memory-augmented reasoning, robust error handling, and efficiency improvements to accelerate the development of practical, user-friendly smartphone agents.

Abstract

Smartphone agents are increasingly important for helping users control devices efficiently, with (Multimodal) Large Language Model (MLLM)-based approaches emerging as key contenders. Fairly comparing these agents is essential but challenging, requiring a varied task scope, the integration of agents with different implementations, and a generalisable evaluation pipeline to assess their strengths and weaknesses. In this paper, we present SPA-Bench, a comprehensive SmartPhone Agent Benchmark designed to evaluate (M)LLM-based agents in an interactive environment that simulates real-world conditions. SPA-Bench offers three key contributions: (1) A diverse set of tasks covering system and third-party apps in both English and Chinese, focusing on features commonly used in daily routines; (2) A plug-and-play framework enabling real-time agent interaction with Android devices, integrating over ten agents with the flexibility to add more; (3) A novel evaluation pipeline that automatically assesses agent performance across multiple dimensions, encompassing seven metrics related to task completion and resource consumption. Our extensive experiments across tasks and agents reveal challenges like interpreting mobile user interfaces, action grounding, memory retention, and execution costs. We propose future research directions to ease these difficulties, moving closer to real-world smartphone agent applications. SPA-Bench is available at https://ai-agents-2030.github.io/SPA-Bench/.

Paper Structure

This paper contains 65 sections, 14 figures, 11 tables.

Figures (14)

  • Figure 1: An overview of SPA-Bench. The worker machine iterates through the task and agent pools, assigning tasks to agents within the framework for execution, and then passes the execution results to the evaluation pipeline for measuring task completion and resource consumption performance.
  • Figure 2: A sample set of tasks within the Deliveroo app, annotated by human. In this example, simpler tasks form the foundation for more complex ones, resulting in shared trajectories in the initial stages. The final screenshots for tasks of all three difficulty levels are highlighted in corresponding colours. Each final screenshot highlights the key components used in coarse detection (explained further in Section \ref{['sec:auto_eval']}), with the zoomed-in versions available in Appendix \ref{['appendix:kc_example']}.
  • Figure 3: An overview of the agent framework using a multi-processing architecture. Each worker process connects an agent to an Android emulator, and they interact multiple times throughout the task (i.e., step 3 is repeated) until completion. The emulators are reset after the agent has executed all assigned tasks.
  • Figure 4: An example of our single-app success detection pipeline. It features coarse detection through key component matching on execution screenshots and pre-annotated key components, followed by fine detection using MLLM evaluation given action information.
  • Figure 5: An example of our cross-app success detection pipeline that is based on subtasks instead of the entire task. The first stage involves splitting the full trajectory into segments, while the second stage checks the subtasks sequentially.
  • ...and 9 more figures