Table of Contents
Fetching ...

M$^2$: Dual-Memory Augmentation for Long-Horizon Web Agents via Trajectory Summarization and Insight Retrieval

Dawei Yan, Haokui Zhang, Guangda Huzhang, Yang Li, Yibo Wang, Qing-Guo Chen, Zhao Xu, Weihua Luo, Ying Li, Wei Dong, Chunhua Shen

TL;DR

This work proposes M$^2, a training-free, memory-augmented framework designed to optimize context efficiency and decision-making robustness in MLLMs, and incorporates a dual-tier memory mechanism that synergizes Dynamic Trajectory Summarization (Internal Memory) and Insight Retrieval Augmentation (External Memory) to guide the agent with actionable guidelines retrieved from an offline insight bank.

Abstract

Multimodal Large Language Models (MLLMs) based agents have demonstrated remarkable potential in autonomous web navigation. However, handling long-horizon tasks remains a critical bottleneck. Prevailing strategies often rely heavily on extensive data collection and model training, yet still struggle with high computational costs and insufficient reasoning capabilities when facing complex, long-horizon scenarios. To address this, we propose M$^2$, a training-free, memory-augmented framework designed to optimize context efficiency and decision-making robustness. Our approach incorporates a dual-tier memory mechanism that synergizes Dynamic Trajectory Summarization (Internal Memory) to compress verbose interaction history into concise state updates, and Insight Retrieval Augmentation (External Memory) to guide the agent with actionable guidelines retrieved from an offline insight bank. Extensive evaluations across WebVoyager and OnlineMind2Web demonstrate that M$^2$ consistently surpasses baselines, yielding up to a 19.6% success rate increase and 58.7% token reduction for Qwen3-VL-32B, while proprietary models like Claude achieve accuracy gains up to 12.5% alongside significantly lower computational overhead.

M$^2$: Dual-Memory Augmentation for Long-Horizon Web Agents via Trajectory Summarization and Insight Retrieval

TL;DR

This work proposes M$^2, a training-free, memory-augmented framework designed to optimize context efficiency and decision-making robustness in MLLMs, and incorporates a dual-tier memory mechanism that synergizes Dynamic Trajectory Summarization (Internal Memory) and Insight Retrieval Augmentation (External Memory) to guide the agent with actionable guidelines retrieved from an offline insight bank.

Abstract

Multimodal Large Language Models (MLLMs) based agents have demonstrated remarkable potential in autonomous web navigation. However, handling long-horizon tasks remains a critical bottleneck. Prevailing strategies often rely heavily on extensive data collection and model training, yet still struggle with high computational costs and insufficient reasoning capabilities when facing complex, long-horizon scenarios. To address this, we propose M, a training-free, memory-augmented framework designed to optimize context efficiency and decision-making robustness. Our approach incorporates a dual-tier memory mechanism that synergizes Dynamic Trajectory Summarization (Internal Memory) to compress verbose interaction history into concise state updates, and Insight Retrieval Augmentation (External Memory) to guide the agent with actionable guidelines retrieved from an offline insight bank. Extensive evaluations across WebVoyager and OnlineMind2Web demonstrate that M consistently surpasses baselines, yielding up to a 19.6% success rate increase and 58.7% token reduction for Qwen3-VL-32B, while proprietary models like Claude achieve accuracy gains up to 12.5% alongside significantly lower computational overhead.
Paper Structure (40 sections, 9 equations, 13 figures, 4 tables)

This paper contains 40 sections, 9 equations, 13 figures, 4 tables.

Figures (13)

  • Figure 1: Comparison with SOTA methods on WebVoyager dataset. $\star$: Official metrics from the original release; note that some tasks within this benchmark are now infeasible due to website updates.
  • Figure 2: Overview of the proposed framework. (a) The baseline agent operates with raw context containing redundant visual history and verbose interaction text. This creates high computational overhead and introduces noise that may impair decision-making. Our method M$^2$ incorporates two key mechanisms: In-Mem and Ex-Mem. (b) The In-Mem module prompts the agent to self-summarize historical steps into a concise context. Simultaneously, (c) the Ex-Mem module retrieves actionable insights from historically successful trajectories based on query similarity. (d) This dual-memory approach provides explicit guidance for the current execution step, significantly reducing errors and yielding tangible performance gains.
  • Figure 3: Comparison between Full Context and In Mem interaction paradigms. The Full Context (left) retains the entire sequence of raw screenshots and exhaustive historical text, leading to high redundancy. In contrast, the In Mem approach (right) employs a specific prompt to synthesize historical steps into a concise textual summary.
  • Figure 4: Overview of Insight Retrieval Augmentation pipeline. The workflow begins with a Trajectory Insight Database populated by successful trajectories from diverse models. For a given user query , the system filters insights based on semantic similarity, retrieving the Top-$i$ relevant insights. These insights are then injected into the System Prompt. This mechanism ensures the Web Agent generates goal-centric and error-proof actions, effectively navigating complex UI environments by leveraging cross-trajectory experiences.
  • Figure 5: Distribution of the 55k trajectory data across 12 distinct web domains in the Insight Bank.
  • ...and 8 more figures