Table of Contents
Fetching ...

PSPA-Bench: A Personalized Benchmark for Smartphone GUI Agent

Hongyi Nie, Xunyuan Liu, Yudong Bai, Yaqing Wang, Yang Liu, Quanming Yao, Zhen Wang

Abstract

Smartphone GUI agents execute tasks by operating directly on app interfaces, offering a path to broad capability without deep system integration. However, real-world smartphone use is highly personalized: users adopt diverse workflows and preferences, challenging agents to deliver customized assistance rather than generic solutions. Existing GUI agent benchmarks cannot adequately capture this personalization dimension due to sparse user-specific data and the lack of fine-grained evaluation metrics. To address this gap, we present PSPA-Bench, the benchmark dedicated to evaluating personalization in smartphone GUI agents. PSPA-Bench comprises over 12,855 personalized instructions aligned with real-world user behaviors across 10 representative daily-use scenarios and 22 mobile apps, and introduces a structure-aware process evaluation method that measures agents' personalized capabilities at a fine-grained level. Through PSPA-Bench, we benchmark 11 state-of-the-art GUI agents. Results reveal that current methods perform poorly under personalized settings, with even the strongest agent achieving limited success. Our analysis further highlights three directions for advancing personalized GUI agents: (1) reasoning-oriented models consistently outperform general LLMs, (2) perception remains a simple yet critical capability, and (3) reflection and long-term memory mechanisms are key to improving adaptation. Together, these findings establish PSPA-Bench as a foundation for systematic study and future progress in personalized GUI agents.

PSPA-Bench: A Personalized Benchmark for Smartphone GUI Agent

Abstract

Smartphone GUI agents execute tasks by operating directly on app interfaces, offering a path to broad capability without deep system integration. However, real-world smartphone use is highly personalized: users adopt diverse workflows and preferences, challenging agents to deliver customized assistance rather than generic solutions. Existing GUI agent benchmarks cannot adequately capture this personalization dimension due to sparse user-specific data and the lack of fine-grained evaluation metrics. To address this gap, we present PSPA-Bench, the benchmark dedicated to evaluating personalization in smartphone GUI agents. PSPA-Bench comprises over 12,855 personalized instructions aligned with real-world user behaviors across 10 representative daily-use scenarios and 22 mobile apps, and introduces a structure-aware process evaluation method that measures agents' personalized capabilities at a fine-grained level. Through PSPA-Bench, we benchmark 11 state-of-the-art GUI agents. Results reveal that current methods perform poorly under personalized settings, with even the strongest agent achieving limited success. Our analysis further highlights three directions for advancing personalized GUI agents: (1) reasoning-oriented models consistently outperform general LLMs, (2) perception remains a simple yet critical capability, and (3) reflection and long-term memory mechanisms are key to improving adaptation. Together, these findings establish PSPA-Bench as a foundation for systematic study and future progress in personalized GUI agents.

Paper Structure

This paper contains 41 sections, 7 equations, 7 figures, 10 tables, 1 algorithm.

Figures (7)

  • Figure 1: GUI agents are shifting from general-purpose, one-time task execution to personalized, long-term service. (i) Given the same instruction, an agent should adapt its execution path to user preferences; (ii) For the same user, the agent uses historical execution experience to provide long-term support; experience up to time $t$ informs a refined execution at $t+1$ for similar instructions.
  • Figure 2: Benchmark framework of PSPA-Bench. (a) Task decomposition graph: the task is decomposed into a directed acyclic graph of unit instructions, where fixed nodes denote universal steps and flexible nodes denote user-specific requirements. (b) Template-driven personalized instruction generation: the TDG is used to construct task templates, which are then instantiated with user preferences to generate personalized instructions. (c) Fine-grained personalized evaluation: The trace-graph alignment is used to compute APR (A-Progress Ratio) and PPR (P-Progress Ratio) for fine-grained evaluation of both immediate and long-term performance.
  • Figure 3: Evaluation metrics taxonomy of PSPA-Bench.
  • Figure 4: Radar plots showing the comparative ranks of different methods under two objectives. Left: Immediate objective, evaluated with APR, PPR, CT, and CPT. Right: Long-term objective, evaluated with $\Delta$APR, $\Delta$PPR, $\Delta$CT, and $\Delta$CPT. Each axis indicates the rank of a method on the corresponding metric, with lower values representing better performance.
  • Figure 5: The task decomposition graphs of 4 personalized scenarios: shopping, dining, navigation, and travel. The nodes with green color refer to fixed type, while nodes with blue color refers to personalized type.
  • ...and 2 more figures

Theorems & Definitions (1)

  • Definition 2.1