MemGUI-Bench: Benchmarking Memory of Mobile GUI Agents in Dynamic Environments
Guangyi Liu, Pengxiang Zhao, Yaozhen Liang, Qinyi Luo, Shunye Tang, Yuxiang Chai, Weifeng Lin, Han Xiao, WenHao Wang, Siheng Chen, Zhengxi Lu, Gao Wu, Hao Wang, Liang Liu, Yong Liu
TL;DR
MemGUI-Bench introduces a memory-centric benchmark suite and evaluation framework for mobile GUI agents operating in dynamic environments. It differentiates short-term and long-term memory, and combines a 128-task suite across 26 apps with 64 mirror-task pairs to probe cross-task learning. The MemGUI-Eval arbiter uses Progressive Scrutiny and seven hierarchical metrics to quantify memory fidelity, learning effectiveness, and execution efficiency, supported by a snapshot-based plug-and-play environment for scalable, multi-attempt evaluation. Across 11 state-of-the-art agents, the study reveals pervasive memory deficits, identifies five failure modes, and offers five design implications to guide future memory-enhanced GUI architectures, with all resources open-source.
Abstract
Current mobile GUI agent benchmarks systematically fail to assess memory capabilities, with only 5.2-11.8% memory-related tasks and no cross-session learning evaluation. We introduce MemGUI-Bench, a comprehensive memory-centric benchmark with pass@k and staged LLM-as-judge evaluation. Our contributions include: (1) a systematic memory taxonomy analyzing 11 agents across 5 architectures; (2) 128 tasks across 26 applications where 89.8% challenge memory through cross-temporal and cross-spatial retention; (3) MemGUI-Eval, an automated pipeline with Progressive Scrutiny and 7 hierarchical metrics; and (4) RQ-driven assessment of 11 state-of-the-art agents. Our experiments reveal significant memory deficits across all evaluated systems, identify 5 distinct failure modes, and synthesize 5 actionable design implications. All resources including code, benchmark, and evaluation results will be \textbf{\textit{fully open-sourced and continuously maintained}} at https://lgy0404.github.io/MemGUI-Bench/.
