Table of Contents
Fetching ...

MGA: Memory-Driven GUI Agent for Observation-Centric Interaction

Weihua Cheng, Ersheng Ni, Wenlong Wang, Yifei Sun, Junming Liu, Wangyu Shen, Yirong Chen, Botian Shi, Ding Wang

TL;DR

The paper tackles GUI agents’ susceptibility to long-horizon state drift and local exploration biases by introducing MGA, a memory-driven framework built on an observe-first paradigm. It formalizes a step-wise environment state $E_t=(I_t, Z_t, S_{t-1})$ and four modules—Observer, Abstract Memory Agent, Planner, and Grounding—that work with a dynamic memory triad to decouple decisions from historical trajectories. Empirical results on OSWorld and real desktop applications show that MGA outperforms strong baselines, with notable improvements in long-horizon tasks and cross-task transfers; ablations confirm that both memory and task-agnostic grounding contribute significantly. The work suggests that combining MGA’s memory-driven cognition with selective code-level execution could yield further efficiency gains in practical GUI automation scenarios.

Abstract

The rapid progress of Large Language Models (LLMs) and their multimodal extensions (MLLMs) has enabled agentic systems capable of perceiving and acting across diverse environments. A challenging yet impactful frontier is the development of GUI agents, which must navigate complex desktop and web interfaces while maintaining robustness and generalization. Existing paradigms typically model tasks as long-chain executions, concatenating historical trajectories into the context. While approaches such as Mirage and GTA1 refine planning or introduce multi-branch action selection, they remain constrained by two persistent issues: Dependence on historical trajectories, which amplifies error propagation. And Local exploration bias, where "decision-first, observation-later" mechanisms overlook critical interface cues. We introduce the Memory-Driven GUI Agent (MGA), which reframes GUI interaction around the principle of observe first, then decide. MGA models each step as an independent, context-rich environment state represented by a triad: current screenshot, task-agnostic spatial information, and a dynamically updated structured memory. Experiments on OSworld benchmarks, real desktop applications (Chrome, VSCode, VLC), and cross-task transfer demonstrate that MGA achieves substantial gains in robustness, generalization, and efficiency compared to state-of-the-art baselines. The code is publicly available at: {https://anonymous.4open.science/r/MGA-3571}.

MGA: Memory-Driven GUI Agent for Observation-Centric Interaction

TL;DR

The paper tackles GUI agents’ susceptibility to long-horizon state drift and local exploration biases by introducing MGA, a memory-driven framework built on an observe-first paradigm. It formalizes a step-wise environment state and four modules—Observer, Abstract Memory Agent, Planner, and Grounding—that work with a dynamic memory triad to decouple decisions from historical trajectories. Empirical results on OSWorld and real desktop applications show that MGA outperforms strong baselines, with notable improvements in long-horizon tasks and cross-task transfers; ablations confirm that both memory and task-agnostic grounding contribute significantly. The work suggests that combining MGA’s memory-driven cognition with selective code-level execution could yield further efficiency gains in practical GUI automation scenarios.

Abstract

The rapid progress of Large Language Models (LLMs) and their multimodal extensions (MLLMs) has enabled agentic systems capable of perceiving and acting across diverse environments. A challenging yet impactful frontier is the development of GUI agents, which must navigate complex desktop and web interfaces while maintaining robustness and generalization. Existing paradigms typically model tasks as long-chain executions, concatenating historical trajectories into the context. While approaches such as Mirage and GTA1 refine planning or introduce multi-branch action selection, they remain constrained by two persistent issues: Dependence on historical trajectories, which amplifies error propagation. And Local exploration bias, where "decision-first, observation-later" mechanisms overlook critical interface cues. We introduce the Memory-Driven GUI Agent (MGA), which reframes GUI interaction around the principle of observe first, then decide. MGA models each step as an independent, context-rich environment state represented by a triad: current screenshot, task-agnostic spatial information, and a dynamically updated structured memory. Experiments on OSworld benchmarks, real desktop applications (Chrome, VSCode, VLC), and cross-task transfer demonstrate that MGA achieves substantial gains in robustness, generalization, and efficiency compared to state-of-the-art baselines. The code is publicly available at: {https://anonymous.4open.science/r/MGA-3571}.

Paper Structure

This paper contains 21 sections, 3 equations, 1 figure, 3 tables.

Figures (1)

  • Figure 1: Detailed workflow of MGA showing internal data flow among the Observer, Memory Agent, Planner, and Grounding pipeline. The structured memory enables consistent reasoning across steps.