WALT: Web Agents that Learn Tools
Viraj Prabhu, Yutong Dai, Matthew Fernandez, Jing Gu, Krithika Ramakrishnan, Yanqi Luo, Silvio Savarese, Caiming Xiong, Junnan Li, Zeyuan Chen, Ran Xu
TL;DR
WALT addresses brittleness in browser automation by learning tools that map to website-provided functionality, rather than relying on brittle UI interaction sequences. It introduces a two-stage pipeline that first discovers site-specific tools and then constructs, validates, and exposes them as deterministic, callable actions, reducing reliance on large-language-model reasoning. Empirical results on VisualWebArena and WebArena demonstrate state-of-the-art success rates and up to 1.4x efficiency gains, with ablations confirming the value of discovered tools, multimodal DOM parsing, and external verification. The approach offers a scalable, auditable paradigm for web automation that generalizes across domains and websites, enabling more reliable and efficient agent behavior.
Abstract
Web agents promise to automate complex browser tasks, but current methods remain brittle -- relying on step-by-step UI interactions and heavy LLM reasoning that break under dynamic layouts and long horizons. Humans, by contrast, exploit website-provided functionality through high-level operations like search, filter, and sort. We introduce WALT (Web Agents that Learn Tools), a framework that reverse-engineers latent website functionality into reusable invocable tools. Rather than hypothesizing ad-hoc skills, WALT exposes robust implementations of automations already designed into websites -- spanning discovery (search, filter, sort), communication (post, comment, upvote), and content management (create, edit, delete). Tools abstract away low-level execution: instead of reasoning about how to click and type, agents simply call search(query) or create(listing). This shifts the computational burden from fragile step-by-step reasoning to reliable tool invocation. On VisualWebArena and WebArena, WALT achieves higher success with fewer steps and less LLM-dependent reasoning, establishing a robust and generalizable paradigm for browser automation.
