Table of Contents
Fetching ...

From Grounding to Planning: Benchmarking Bottlenecks in Web Agents

Segev Shlomov, Ben wiesel, Aviad Sela, Ido Levy, Liane Galanti, Roy Abitbol

TL;DR

This work proposes a new benchmark for each of the components separately, identifying the bottlenecks and pain points that limit agent performance and suggests that grounding is not a significant bottleneck and can be effectively addressed with current techniques.

Abstract

General web-based agents are increasingly essential for interacting with complex web environments, yet their performance in real-world web applications remains poor, yielding extremely low accuracy even with state-of-the-art frontier models. We observe that these agents can be decomposed into two primary components: Planning and Grounding. Yet, most existing research treats these agents as black boxes, focusing on end-to-end evaluations which hinder meaningful improvements. We sharpen the distinction between the planning and grounding components and conduct a novel analysis by refining experiments on the Mind2Web dataset. Our work proposes a new benchmark for each of the components separately, identifying the bottlenecks and pain points that limit agent performance. Contrary to prevalent assumptions, our findings suggest that grounding is not a significant bottleneck and can be effectively addressed with current techniques. Instead, the primary challenge lies in the planning component, which is the main source of performance degradation. Through this analysis, we offer new insights and demonstrate practical suggestions for improving the capabilities of web agents, paving the way for more reliable agents.

From Grounding to Planning: Benchmarking Bottlenecks in Web Agents

TL;DR

This work proposes a new benchmark for each of the components separately, identifying the bottlenecks and pain points that limit agent performance and suggests that grounding is not a significant bottleneck and can be effectively addressed with current techniques.

Abstract

General web-based agents are increasingly essential for interacting with complex web environments, yet their performance in real-world web applications remains poor, yielding extremely low accuracy even with state-of-the-art frontier models. We observe that these agents can be decomposed into two primary components: Planning and Grounding. Yet, most existing research treats these agents as black boxes, focusing on end-to-end evaluations which hinder meaningful improvements. We sharpen the distinction between the planning and grounding components and conduct a novel analysis by refining experiments on the Mind2Web dataset. Our work proposes a new benchmark for each of the components separately, identifying the bottlenecks and pain points that limit agent performance. Contrary to prevalent assumptions, our findings suggest that grounding is not a significant bottleneck and can be effectively addressed with current techniques. Instead, the primary challenge lies in the planning component, which is the main source of performance degradation. Through this analysis, we offer new insights and demonstrate practical suggestions for improving the capabilities of web agents, paving the way for more reliable agents.
Paper Structure (42 sections, 10 figures, 4 tables)

This paper contains 42 sections, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Grounding Pipeline: The website's DOM, screenshot, and low-level instruction are processed through the PU and element ranking phases to generate an annotated screen capture with SoM and a prompt. These are then passed together with the low-level instruction to the VLM to select the web element
  • Figure 2: WebNaviX Architecture: The website's DOM, screenshot, and high-level instruction are processed through the PU and element ranking phase equipped with an improved ranking to generate an annotated screen capture with SoM and a prompt. These are then passed together with the high-level instruction and histroy of actions to the VLM to select the web element.
  • Figure 3: Planner performance across task flow steps.
  • Figure 4: Spatial distribution of the ground truth element relative to the page layouts, within the training dataset. Lower Y values indicate proximity to the top of the page and lower X values indicate proximity to the left side of the page. Colors indicate the length of the text.
  • Figure 5: Frequency histogram of the number of words in the ground truth text on the Mind2Web training dataset.
  • ...and 5 more figures