Table of Contents
Fetching ...

MemGUI-Bench: Benchmarking Memory of Mobile GUI Agents in Dynamic Environments

Guangyi Liu, Pengxiang Zhao, Yaozhen Liang, Qinyi Luo, Shunye Tang, Yuxiang Chai, Weifeng Lin, Han Xiao, WenHao Wang, Siheng Chen, Zhengxi Lu, Gao Wu, Hao Wang, Liang Liu, Yong Liu

TL;DR

MemGUI-Bench introduces a memory-centric benchmark suite and evaluation framework for mobile GUI agents operating in dynamic environments. It differentiates short-term and long-term memory, and combines a 128-task suite across 26 apps with 64 mirror-task pairs to probe cross-task learning. The MemGUI-Eval arbiter uses Progressive Scrutiny and seven hierarchical metrics to quantify memory fidelity, learning effectiveness, and execution efficiency, supported by a snapshot-based plug-and-play environment for scalable, multi-attempt evaluation. Across 11 state-of-the-art agents, the study reveals pervasive memory deficits, identifies five failure modes, and offers five design implications to guide future memory-enhanced GUI architectures, with all resources open-source.

Abstract

Current mobile GUI agent benchmarks systematically fail to assess memory capabilities, with only 5.2-11.8% memory-related tasks and no cross-session learning evaluation. We introduce MemGUI-Bench, a comprehensive memory-centric benchmark with pass@k and staged LLM-as-judge evaluation. Our contributions include: (1) a systematic memory taxonomy analyzing 11 agents across 5 architectures; (2) 128 tasks across 26 applications where 89.8% challenge memory through cross-temporal and cross-spatial retention; (3) MemGUI-Eval, an automated pipeline with Progressive Scrutiny and 7 hierarchical metrics; and (4) RQ-driven assessment of 11 state-of-the-art agents. Our experiments reveal significant memory deficits across all evaluated systems, identify 5 distinct failure modes, and synthesize 5 actionable design implications. All resources including code, benchmark, and evaluation results will be \textbf{\textit{fully open-sourced and continuously maintained}} at https://lgy0404.github.io/MemGUI-Bench/.

MemGUI-Bench: Benchmarking Memory of Mobile GUI Agents in Dynamic Environments

TL;DR

MemGUI-Bench introduces a memory-centric benchmark suite and evaluation framework for mobile GUI agents operating in dynamic environments. It differentiates short-term and long-term memory, and combines a 128-task suite across 26 apps with 64 mirror-task pairs to probe cross-task learning. The MemGUI-Eval arbiter uses Progressive Scrutiny and seven hierarchical metrics to quantify memory fidelity, learning effectiveness, and execution efficiency, supported by a snapshot-based plug-and-play environment for scalable, multi-attempt evaluation. Across 11 state-of-the-art agents, the study reveals pervasive memory deficits, identifies five failure modes, and offers five design implications to guide future memory-enhanced GUI architectures, with all resources open-source.

Abstract

Current mobile GUI agent benchmarks systematically fail to assess memory capabilities, with only 5.2-11.8% memory-related tasks and no cross-session learning evaluation. We introduce MemGUI-Bench, a comprehensive memory-centric benchmark with pass@k and staged LLM-as-judge evaluation. Our contributions include: (1) a systematic memory taxonomy analyzing 11 agents across 5 architectures; (2) 128 tasks across 26 applications where 89.8% challenge memory through cross-temporal and cross-spatial retention; (3) MemGUI-Eval, an automated pipeline with Progressive Scrutiny and 7 hierarchical metrics; and (4) RQ-driven assessment of 11 state-of-the-art agents. Our experiments reveal significant memory deficits across all evaluated systems, identify 5 distinct failure modes, and synthesize 5 actionable design implications. All resources including code, benchmark, and evaluation results will be \textbf{\textit{fully open-sourced and continuously maintained}} at https://lgy0404.github.io/MemGUI-Bench/.
Paper Structure (85 sections, 8 equations, 22 figures, 13 tables)

This paper contains 85 sections, 8 equations, 22 figures, 13 tables.

Figures (22)

  • Figure 1: An overview of MemGUI-Bench, first comprehensive benchmark for GUI agent memory evaluation.
  • Figure 2: Statistical overview of the MemGUI-Bench task suite.
  • Figure 3: The unified architecture of MemGUI-Bench's snapshot-based plug-and-play framework.
  • Figure 4: MemGUI-Eval's three-stage progressive scrutiny pipeline.
  • Figure 5: Performance comparison between HTML]FF9896MemGUI-Bench (89.8% memory-intensive) and HTML]4D80B2AndroidWorld (5.2% memory-intensive). Red annotations show performance drops on memory-intensive tasks.
  • ...and 17 more figures