Table of Contents
Fetching ...

Why Do LLM-based Web Agents Fail? A Hierarchical Planning Perspective

Mohamed Aghzal, Gregory J. Stein, Ziyu Yao

Abstract

Large language model (LLM) web agents are increasingly used for web navigation but remain far from human reliability on realistic, long-horizon tasks. Existing evaluations focus primarily on end-to-end success, offering limited insight into where failures arise. We propose a hierarchical planning framework to analyze web agents across three layers (i.e., high-level planning, low-level execution, and replanning), enabling process-based evaluation of reasoning, grounding, and recovery. Our experiments show that structured Planning Domain Definition Language (PDDL) plans produce more concise and goal-directed strategies than natural language (NL) plans, but low-level execution remains the dominant bottleneck. These results indicate that improving perceptual grounding and adaptive control, not only high-level reasoning, is critical for achieving human-level reliability. This hierarchical perspective provides a principled foundation for diagnosing and advancing LLM web agents.

Why Do LLM-based Web Agents Fail? A Hierarchical Planning Perspective

Abstract

Large language model (LLM) web agents are increasingly used for web navigation but remain far from human reliability on realistic, long-horizon tasks. Existing evaluations focus primarily on end-to-end success, offering limited insight into where failures arise. We propose a hierarchical planning framework to analyze web agents across three layers (i.e., high-level planning, low-level execution, and replanning), enabling process-based evaluation of reasoning, grounding, and recovery. Our experiments show that structured Planning Domain Definition Language (PDDL) plans produce more concise and goal-directed strategies than natural language (NL) plans, but low-level execution remains the dominant bottleneck. These results indicate that improving perceptual grounding and adaptive control, not only high-level reasoning, is critical for achieving human-level reliability. This hierarchical perspective provides a principled foundation for diagnosing and advancing LLM web agents.
Paper Structure (51 sections, 14 figures, 11 tables)

This paper contains 51 sections, 14 figures, 11 tables.

Figures (14)

  • Figure 1: Overview of the hierarchical planning evaluation framework we propose. The pipeline consists of 3 stages: 1) High-level Planning: The LLM proposes high-level subgoals, 2) Low-level Execution: each high-level subgoal is translated into a set of low-level actions, a postcondition checker verifies whether the low-level actions lead to successful completion of the subgoal. If the subgoal fails after multiple iterations, then 3) Replanning is triggered. In addition to natural language, we explore a structured representation (PDDL) for high-level planning.
  • Figure 2: Execution results of different representations
  • Figure 3: Performance with and without replanning
  • Figure 4: Examples of the high-level step annotation process We begin by prompting gpt-5-nano to produce a high-level step based on the evaluation key-node object and correct issues such overspecification and steps framed as evaluation functions manually.
  • Figure 5: Evaluation trees for high-level alignment.
  • ...and 9 more figures