Why Do LLM-based Web Agents Fail? A Hierarchical Planning Perspective

Mohamed Aghzal; Gregory J. Stein; Ziyu Yao

Why Do LLM-based Web Agents Fail? A Hierarchical Planning Perspective

Mohamed Aghzal, Gregory J. Stein, Ziyu Yao

Abstract

Large language model (LLM) web agents are increasingly used for web navigation but remain far from human reliability on realistic, long-horizon tasks. Existing evaluations focus primarily on end-to-end success, offering limited insight into where failures arise. We propose a hierarchical planning framework to analyze web agents across three layers (i.e., high-level planning, low-level execution, and replanning), enabling process-based evaluation of reasoning, grounding, and recovery. Our experiments show that structured Planning Domain Definition Language (PDDL) plans produce more concise and goal-directed strategies than natural language (NL) plans, but low-level execution remains the dominant bottleneck. These results indicate that improving perceptual grounding and adaptive control, not only high-level reasoning, is critical for achieving human-level reliability. This hierarchical perspective provides a principled foundation for diagnosing and advancing LLM web agents.

Why Do LLM-based Web Agents Fail? A Hierarchical Planning Perspective

Abstract

Paper Structure (51 sections, 14 figures, 11 tables)

This paper contains 51 sections, 14 figures, 11 tables.

Introduction
Preliminaries: Hierarchical Planning-based Web Agents
High-level Planning
Low-level Execution
Postcondition Checking
Replanning
Hierarchical Evaluation of LLMs for Web Agent Applications
Why Evaluating via a Hierarchical Planning Perspective?
High-level Planning
Research Questions
Evaluation Metrics
Low-level Execution
Research Questions
Evaluation Metrics
Replanning
...and 36 more sections

Figures (14)

Figure 1: Overview of the hierarchical planning evaluation framework we propose. The pipeline consists of 3 stages: 1) High-level Planning: The LLM proposes high-level subgoals, 2) Low-level Execution: each high-level subgoal is translated into a set of low-level actions, a postcondition checker verifies whether the low-level actions lead to successful completion of the subgoal. If the subgoal fails after multiple iterations, then 3) Replanning is triggered. In addition to natural language, we explore a structured representation (PDDL) for high-level planning.
Figure 2: Execution results of different representations
Figure 3: Performance with and without replanning
Figure 4: Examples of the high-level step annotation process We begin by prompting gpt-5-nano to produce a high-level step based on the evaluation key-node object and correct issues such overspecification and steps framed as evaluation functions manually.
Figure 5: Evaluation trees for high-level alignment.
...and 9 more figures

Why Do LLM-based Web Agents Fail? A Hierarchical Planning Perspective

Abstract

Why Do LLM-based Web Agents Fail? A Hierarchical Planning Perspective

Authors

Abstract

Table of Contents

Figures (14)