AndroTMem: From Interaction Trajectories to Anchored Memory in Long-Horizon GUI Agents

Yibo Shi; Jungang Li; Linghao Zhang; Zihao Dongfang; Biao Wu; Sicheng Tao; Yibo Yan; Chenxi Qin; Weiting Liu; Zhixin Lin; Hanqian Li; Yu Huang; Song Dai; Yonghua Hei; Yue Ding; Xiang Li; Shikang Wang; Chengdong Xu; Jingqi Liu; Xueying Ma; Zhiwen Zheng; Xiaofei Zhang; Bincheng Wang; Nichen Yang; Jie Wu; Lihua Tian; Chen Li; Xuming Hu

AndroTMem: From Interaction Trajectories to Anchored Memory in Long-Horizon GUI Agents

Yibo Shi, Jungang Li, Linghao Zhang, Zihao Dongfang, Biao Wu, Sicheng Tao, Yibo Yan, Chenxi Qin, Weiting Liu, Zhixin Lin, Hanqian Li, Yu Huang, Song Dai, Yonghua Hei, Yue Ding, Xiang Li, Shikang Wang, Chengdong Xu, Jingqi Liu, Xueying Ma, Zhiwen Zheng, Xiaofei Zhang, Bincheng Wang, Nichen Yang, Jie Wu, Lihua Tian, Chen Li, Xuming Hu

Abstract

Long-horizon GUI agents are a key step toward real-world deployment, yet effective interaction memory under prevailing paradigms remains under-explored. Replaying full interaction sequences is redundant and amplifies noise, while summaries often erase dependency-critical information and traceability. We present AndroTMem, a diagnostic framework for anchored memory in long-horizon Android GUI agents. Its core benchmark, AndroTMem-Bench, comprises 1,069 tasks with 34,473 interaction steps (avg. 32.1 per task, max. 65). We evaluate agents with TCR (Task Complete Rate), focusing on tasks whose completion requires carrying forward critical intermediate state; AndroTMem-Bench is designed to enforce strong step-to-step causal dependencies, making sparse yet essential intermediate states decisive for downstream actions and centering interaction memory in evaluation. Across open- and closed-source GUI agents, we observe a consistent pattern: as interaction sequences grow longer, performance drops are driven mainly by within-task memory failures, not isolated perception errors or local action mistakes. Guided by this diagnosis, we propose Anchored State Memory (ASM), which represents interaction sequences as a compact set of causally linked intermediate-state anchors to enable subgoal-targeted retrieval and attribution-aware decision making. Across multiple settings and 12 evaluated GUI agents, ASM consistently outperforms full-sequence replay and summary-based baselines, improving TCR by 5%-30.16% and AMS by 4.93%-24.66%, indicating that anchored, structured memory effectively mitigates the interaction-memory bottleneck in long-horizon GUI tasks. The code, benchmark, and related resources are publicly available at [https://github.com/CVC2233/AndroTMem](https://github.com/CVC2233/AndroTMem).

AndroTMem: From Interaction Trajectories to Anchored Memory in Long-Horizon GUI Agents

Abstract

Paper Structure (59 sections, 6 equations, 8 figures, 10 tables)

This paper contains 59 sections, 6 equations, 8 figures, 10 tables.

Introduction
Related Work
Long-Horizon Task Execution and Memory in GUI Agents.
Benchmarks and Datasets for Mobile GUI Agents.
Dataset Construction and Statistics
Long-Horizon Task Formulation
Data Pipeline
Data Statics
AndroTMem-Bench
Experiment Setup
Benchmark Results and Diagnosis
Anchored State Memory
Motivation from Benchmark Diagnosis
Anchored State Memory Definition
History Utilization Ablation: Does ASM Help?
...and 44 more sections

Figures (8)

Figure 1: Overview of AndroTMem. AndroTMem comprises: (i)AndroTMem-Bench, a long-horizon Android GUI benchmark constructed with intent-driven, cross-app, causally dependent workflows; (ii) representative task cases where sparse intermediate states determine downstream decisions; (iii) a diagnostic evaluation suite showing that performance drops with horizon length are primarily caused by memory failures; and (iv)Anchored State Memory (ASM), which stores causally linked intermediate-state anchors for targeted retrieval and improves long-horizon GUI-agent performance.
Figure 2: Overview of the AndroTMem-Bench dataset construction pipeline. (1) Collect popular mobile apps and group them by function. (2) Generate long-horizon cross-app task instructions with step-to-step causal dependencies using dependency-aware templates. (3) Execute and annotate tasks on Android devices or emulators, producing quality-checked trajectories and dataset outputs.
Figure 3: Overview statistics of AndroTMem-Bench. The first row reports the top app combinations, step length distribution by task type, and overall trajectory length distribution (with comparison to prior benchmarks). The second row shows app usage frequency, overall action-type proportions, and action diversity per task.
Figure 4: Agent performance across different interaction step ranges under three history utilization strategies: (a) Raw History, (b) Coarse Summary, and (c) Anchored State Memory (ASM).
Figure 5: Model performance across different task categories under three history utilization strategies.
...and 3 more figures

AndroTMem: From Interaction Trajectories to Anchored Memory in Long-Horizon GUI Agents

Abstract

AndroTMem: From Interaction Trajectories to Anchored Memory in Long-Horizon GUI Agents

Authors

Abstract

Table of Contents

Figures (8)