Table of Contents
Fetching ...

Enhancing Web Agents with a Hierarchical Memory Tree

Yunteng Tan, Zhi Gao, Xinxiao Wu

TL;DR

Experimental results on Mind2Web and WebArena show that HMT significantly outperforms flat-memory methods, particularly in cross-website and cross-domain scenarios, highlighting the necessity of structured memory for robust generalization of web agents.

Abstract

Large language model-based web agents have shown strong potential in automating web interactions through advanced reasoning and instruction following. While retrieval-based memory derived from historical trajectories enables these agents to handle complex, long-horizon tasks, current methods struggle to generalize across unseen websites. We identify that this challenge arises from the flat memory structures that entangle high-level task logic with site-specific action details. This entanglement induces a workflow mismatch in new environments, where retrieved contents are conflated with current web, leading to logically inconsistent execution. To address this, we propose Hierarchical Memory Tree (HMT), a structured framework designed to explicitly decouple logical planning from action execution. HMT constructs a three-level hierarchy from raw trajectories via an automated abstraction pipeline: the Intent level maps diverse user instructions to standardized task goals; the Stage level defines reusable semantic subgoals characterized by observable pre-conditions and post-conditions; and the Action level stores action patterns paired with transferable semantic element descriptions. Leveraging this structure, we develop a stage-aware inference mechanism comprising a Planner and an Actor. By explicitly validating pre-conditions, the Planner aligns the current state with the correct logical subgoal to prevent workflow mismatch, while the Actor grounds actions by matching the stored semantic descriptions to the target page. Experimental results on Mind2Web and WebArena show that HMT significantly outperforms flat-memory methods, particularly in cross-website and cross-domain scenarios, highlighting the necessity of structured memory for robust generalization of web agents.

Enhancing Web Agents with a Hierarchical Memory Tree

TL;DR

Experimental results on Mind2Web and WebArena show that HMT significantly outperforms flat-memory methods, particularly in cross-website and cross-domain scenarios, highlighting the necessity of structured memory for robust generalization of web agents.

Abstract

Large language model-based web agents have shown strong potential in automating web interactions through advanced reasoning and instruction following. While retrieval-based memory derived from historical trajectories enables these agents to handle complex, long-horizon tasks, current methods struggle to generalize across unseen websites. We identify that this challenge arises from the flat memory structures that entangle high-level task logic with site-specific action details. This entanglement induces a workflow mismatch in new environments, where retrieved contents are conflated with current web, leading to logically inconsistent execution. To address this, we propose Hierarchical Memory Tree (HMT), a structured framework designed to explicitly decouple logical planning from action execution. HMT constructs a three-level hierarchy from raw trajectories via an automated abstraction pipeline: the Intent level maps diverse user instructions to standardized task goals; the Stage level defines reusable semantic subgoals characterized by observable pre-conditions and post-conditions; and the Action level stores action patterns paired with transferable semantic element descriptions. Leveraging this structure, we develop a stage-aware inference mechanism comprising a Planner and an Actor. By explicitly validating pre-conditions, the Planner aligns the current state with the correct logical subgoal to prevent workflow mismatch, while the Actor grounds actions by matching the stored semantic descriptions to the target page. Experimental results on Mind2Web and WebArena show that HMT significantly outperforms flat-memory methods, particularly in cross-website and cross-domain scenarios, highlighting the necessity of structured memory for robust generalization of web agents.
Paper Structure (26 sections, 2 equations, 6 figures, 5 tables, 2 algorithms)

This paper contains 26 sections, 2 equations, 6 figures, 5 tables, 2 algorithms.

Figures (6)

  • Figure 1: Comparison between Flat Memory and Hierarchical Memory Tree (HMT). (a) Flat memory methods retrieve interaction trajectories mixed with original workflows and source-specific implementation details w/o subgoals, leading to workflow mismatch and context pollution when applied to unseen websites. (b)HMT decouples intent from execution using a tree structure. It retrieves stage-aligned subgoals and abstract element descriptions, enabling the agent to plan the correct step and ground actions on the target interface effectively.
  • Figure 2: Overview of HMT. The framework consists of a construction pipeline that abstracts raw trajectories into a hierarchical memory tree, and a stage-aware inference mechanism where a Planner selects the logical stage and an Actor grounds the action level description to the target page.
  • Figure 3: Structure of the Hierarchical Memory Tree. Unlike flat lists, HMT organizes memory into intent, stage, and action levels.
  • Figure 4: Mechanism Analysis. (a) Retrieval recall comparisons show that HMT provides more accurate context. (b) Grounding success rate across generalization splits shows that raw identifiers fail in cross-website and cross-domain settings, while the semantic descriptions (ours) remain robust.
  • Figure 5: Visual analysis of a successful cross-website grounding trace demonstrated by HMT. (a) Memory Retrieval: The agent retrieves an abstract action pattern and a semantic descriptor from the hierarchical memory, explicitly discarding the site-specific raw identifier (#btn-sfo-136) from the source trace. (b) Planner Verification & Decision: To prevent workflow mismatch, the Planner verifies that the current page satisfies the stage pre-conditions (e.g., "Flight list visible") before authorizing the Actor, ensuring actions are only executed in the correct context. (c) Actor Grounding & Execution: Guided by the semantic descriptor, the Actor scans the target DOM on Trip.com. It successfully distinguishes between a distractor advertisement and the correct flight selection button, executing the correct action despite the layout shift.
  • ...and 1 more figures