Table of Contents
Fetching ...

WebCoach: Self-Evolving Web Agents with Cross-Session Memory Guidance

Genglin Liu, Shijie Geng, Sha Li, Hejie Cui, Sarah Zhang, Xin Liu, Tianyi Liu

TL;DR

WebCoachAddress the lack of long-term memory in multimodal web agents by introducing a memory-centered framework that persists cross-session experiences. The system comprises WebCondenser for trajectory summarization, an External Memory Store (EMS) for episodic experiences, and a Coach that retrieves relevant memories and selectively injects guidance into the agent in real time. Experiments on the WebVoyager benchmark show consistent performance gains across multiple base models, with self-generated memories offering the strongest transfer and larger models benefiting the most. The approach enables self-evolving, memory-guided web agents that improve robustness and efficiency without retraining, highlighting memory as a critical driver for real-world web navigation.

Abstract

Multimodal LLM-powered agents have recently demonstrated impressive capabilities in web navigation, enabling agents to complete complex browsing tasks across diverse domains. However, current agents struggle with repetitive errors and lack the ability to learn from past experiences across sessions, limiting their long-term robustness and sample efficiency. We introduce WebCoach, a model-agnostic self-evolving framework that equips web browsing agents with persistent cross-session memory, enabling improved long-term planning, reflection, and continual learning without retraining. WebCoach consists of three key components: (1) a WebCondenser, which standardizes raw navigation logs into concise summaries; (2) an External Memory Store, which organizes complete trajectories as episodic experiences; and (3) a Coach, which retrieves relevant experiences based on similarity and recency, and decides whether to inject task-specific advice into the agent via runtime hooks. This design empowers web agents to access long-term memory beyond their native context window, improving robustness in complex browsing tasks. Moreover, WebCoach achieves self-evolution by continuously curating episodic memory from new navigation trajectories, enabling agents to improve over time without retraining. Evaluations on the WebVoyager benchmark demonstrate that WebCoach consistently improves the performance of browser-use agents across three different LLM backbones. With a 38B model, it increases task success rates from 47% to 61% while reducing or maintaining the average number of steps. Notably, smaller base models with WebCoach achieve performance comparable to the same web agent using GPT-4o.

WebCoach: Self-Evolving Web Agents with Cross-Session Memory Guidance

TL;DR

WebCoachAddress the lack of long-term memory in multimodal web agents by introducing a memory-centered framework that persists cross-session experiences. The system comprises WebCondenser for trajectory summarization, an External Memory Store (EMS) for episodic experiences, and a Coach that retrieves relevant memories and selectively injects guidance into the agent in real time. Experiments on the WebVoyager benchmark show consistent performance gains across multiple base models, with self-generated memories offering the strongest transfer and larger models benefiting the most. The approach enables self-evolving, memory-guided web agents that improve robustness and efficiency without retraining, highlighting memory as a critical driver for real-world web navigation.

Abstract

Multimodal LLM-powered agents have recently demonstrated impressive capabilities in web navigation, enabling agents to complete complex browsing tasks across diverse domains. However, current agents struggle with repetitive errors and lack the ability to learn from past experiences across sessions, limiting their long-term robustness and sample efficiency. We introduce WebCoach, a model-agnostic self-evolving framework that equips web browsing agents with persistent cross-session memory, enabling improved long-term planning, reflection, and continual learning without retraining. WebCoach consists of three key components: (1) a WebCondenser, which standardizes raw navigation logs into concise summaries; (2) an External Memory Store, which organizes complete trajectories as episodic experiences; and (3) a Coach, which retrieves relevant experiences based on similarity and recency, and decides whether to inject task-specific advice into the agent via runtime hooks. This design empowers web agents to access long-term memory beyond their native context window, improving robustness in complex browsing tasks. Moreover, WebCoach achieves self-evolution by continuously curating episodic memory from new navigation trajectories, enabling agents to improve over time without retraining. Evaluations on the WebVoyager benchmark demonstrate that WebCoach consistently improves the performance of browser-use agents across three different LLM backbones. With a 38B model, it increases task success rates from 47% to 61% while reducing or maintaining the average number of steps. Notably, smaller base models with WebCoach achieve performance comparable to the same web agent using GPT-4o.

Paper Structure

This paper contains 31 sections, 1 equation, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Overview of the WebCoach framework. WebCoach augments web-browsing agents with persistent, cross-session memory through an External Memory Store (EMS) and a retrieval-augmented coaching mechanism. The Condenser converts raw navigation histories into standardized summaries stored in EMS, from which the Coach retrieves relevant prior experiences to provide task-specific guidance to the main web agent. This design enables long-term planning, reflection, and continual improvement across browsing sessions.
  • Figure 2: Retrieval speed at k for the EMS with 600 trajectories. Repeat each 200 times to measure the consistency. Most runs end up averaging between 9.0 and 9.5 ms for k ranging from 1 to 10.
  • Figure 3: Asynchronous Evaluation of WebVoyager. WebCoach's asynchronous evaluation pipeline distributes the 15 subdomains in WebVoyager (e.g., Amazon, Apple, ArXiv) across parallel evaluation queues to maximize throughput and GPU utilization. Yellow boxes indicates in-progress tasks, green indicates completed tasks, and blue indicates tasks that are waiting in the queue. Our limited compute supports running 5 tasks in parallel, and once a task finishes earlier than the others in a batch, we immediately start another task from the wait list instead of waiting for the entire batch to finish. This asynchronous queueing strategy reduces total evaluation time by over 80%, enabling scalable benchmarking of web agents at large scale.
  • Figure 4: Performance comparison across base models. WebCoach consistently improves browser-use agents’ reasoning and robustness across different backbones. The framework achieves higher success rates with equal or fewer average steps, while maintaining efficient completion times.
  • Figure 5: Step 1 Screenshot
  • ...and 3 more figures