WebSuite: Systematically Evaluating Why Web Agents Fail
Eric Li, Jim Waldo
TL;DR
WebSuite introduces a diagnostic benchmark and a structured taxonomy of web actions to identify why generalist web agents fail, not merely whether they succeed. By combining low-level action tasks with end-to-end scenarios in a logging-enabled environment, it enables per-action failure attribution and targeted improvement. The authors demonstrate the approach on two agents (natbot and SeeAct), revealing distinct weakness patterns such as form filling and link-targeting behavior, and argue for broader, real-world evaluation and automation of failure analysis. The work provides a foundation for more granular, actionable benchmarking that can accelerate the development of robust web agents and suggests concrete future directions for taxonomy expansion and automation.
Abstract
We describe WebSuite, the first diagnostic benchmark for generalist web agents, designed to systematically evaluate why agents fail. Advances in AI have led to the rise of numerous web agents that autonomously operate a browser to complete tasks. However, most existing benchmarks focus on strictly measuring whether an agent can or cannot complete a task, without giving insight on why. In this paper, we 1) develop a taxonomy of web actions to facilitate identifying common failure patterns, and 2) create an extensible benchmark suite to assess agents' performance on our taxonomized actions. This benchmark suite consists of both individual tasks, such as clicking a button, and end-to-end tasks, such as adding an item to a cart, and is designed such that any failure of a task can be attributed directly to a failure of a specific web action. We evaluate two popular generalist web agents, one text-based and one multimodal, and identify unique weaknesses for each agent. Because WebSuite can disaggregate task failures into specific action failures, this enables granular identification of which UX flows an individual agent has trouble with and immediately highlights promising avenues for improvement. These findings highlight the need for more focused benchmarking on where web agents go wrong to effectively improve agents beyond their weaker performance today.
