Table of Contents
Fetching ...

WALT: Web Agents that Learn Tools

Viraj Prabhu, Yutong Dai, Matthew Fernandez, Jing Gu, Krithika Ramakrishnan, Yanqi Luo, Silvio Savarese, Caiming Xiong, Junnan Li, Zeyuan Chen, Ran Xu

TL;DR

WALT addresses brittleness in browser automation by learning tools that map to website-provided functionality, rather than relying on brittle UI interaction sequences. It introduces a two-stage pipeline that first discovers site-specific tools and then constructs, validates, and exposes them as deterministic, callable actions, reducing reliance on large-language-model reasoning. Empirical results on VisualWebArena and WebArena demonstrate state-of-the-art success rates and up to 1.4x efficiency gains, with ablations confirming the value of discovered tools, multimodal DOM parsing, and external verification. The approach offers a scalable, auditable paradigm for web automation that generalizes across domains and websites, enabling more reliable and efficient agent behavior.

Abstract

Web agents promise to automate complex browser tasks, but current methods remain brittle -- relying on step-by-step UI interactions and heavy LLM reasoning that break under dynamic layouts and long horizons. Humans, by contrast, exploit website-provided functionality through high-level operations like search, filter, and sort. We introduce WALT (Web Agents that Learn Tools), a framework that reverse-engineers latent website functionality into reusable invocable tools. Rather than hypothesizing ad-hoc skills, WALT exposes robust implementations of automations already designed into websites -- spanning discovery (search, filter, sort), communication (post, comment, upvote), and content management (create, edit, delete). Tools abstract away low-level execution: instead of reasoning about how to click and type, agents simply call search(query) or create(listing). This shifts the computational burden from fragile step-by-step reasoning to reliable tool invocation. On VisualWebArena and WebArena, WALT achieves higher success with fewer steps and less LLM-dependent reasoning, establishing a robust and generalizable paradigm for browser automation.

WALT: Web Agents that Learn Tools

TL;DR

WALT addresses brittleness in browser automation by learning tools that map to website-provided functionality, rather than relying on brittle UI interaction sequences. It introduces a two-stage pipeline that first discovers site-specific tools and then constructs, validates, and exposes them as deterministic, callable actions, reducing reliance on large-language-model reasoning. Empirical results on VisualWebArena and WebArena demonstrate state-of-the-art success rates and up to 1.4x efficiency gains, with ablations confirming the value of discovered tools, multimodal DOM parsing, and external verification. The approach offers a scalable, auditable paradigm for web automation that generalizes across domains and websites, enabling more reliable and efficient agent behavior.

Abstract

Web agents promise to automate complex browser tasks, but current methods remain brittle -- relying on step-by-step UI interactions and heavy LLM reasoning that break under dynamic layouts and long horizons. Humans, by contrast, exploit website-provided functionality through high-level operations like search, filter, and sort. We introduce WALT (Web Agents that Learn Tools), a framework that reverse-engineers latent website functionality into reusable invocable tools. Rather than hypothesizing ad-hoc skills, WALT exposes robust implementations of automations already designed into websites -- spanning discovery (search, filter, sort), communication (post, comment, upvote), and content management (create, edit, delete). Tools abstract away low-level execution: instead of reasoning about how to click and type, agents simply call search(query) or create(listing). This shifts the computational burden from fragile step-by-step reasoning to reliable tool invocation. On VisualWebArena and WebArena, WALT achieves higher success with fewer steps and less LLM-dependent reasoning, establishing a robust and generalizable paradigm for browser automation.

Paper Structure

This paper contains 18 sections, 1 equation, 8 figures, 2 tables, 1 algorithm.

Figures (8)

  • Figure 1: WALT transforms browser agent automation from brittle step-by-step reasoning to efficient tool-based abstraction. Given the task "find the cheapest blue kayak," traditional web agents execute a lengthy sequence of primitive UI actions focusing on search boxes, hovering over dropdowns, clicking categories, and sorting and scanning results. In contrast, our method WALT (Web Agents that Learn Tools), designs a deterministic tool that exposes this website-provided functionality to the agent: search(query='blue kayak', category='Boats', sort_by='price'), reducing execution from 8+ fragile UI steps to 1 robust operation.
  • Figure 2: Overview of WALT. Left—Discovery: the browser agent explores key site sections to propose tool candidates and record stabilized interaction traces (robust selectors with fallbacks). Right—Construction & validation: the tool constructor turns traces into an action script (navigation, extraction, interaction, agentic steps), promotes eligible UI chains to URL operations, induces a validated input schema, then registers and tests the tool end to end. Feedback refines selectors, schema, and script until a robust single-call tool is produced.
  • Figure 3: Results on VisualWebArena.Left. We report success rate (%) on each split as well as a weighted average. Right. We compare WALT's performance and efficiency with a baseline implementation as control.
  • Figure 4: Detailed analysis of the composition, success rates, and runtime invocations of tools discovered on the VisualWebArena Classifieds split.
  • Figure 5: Qualitative rollouts of WALT. Each row shows a task with tiled screenshots (left to right) and the agent's actions at each step (gray bars). Left, top: [PASS] "Recall exact item and return the most recent lister's email." The agent chains search_listings$\to$sort_results$\to$extract_content$\to$click, then surfaces the email from the item page. Right, top: [FAIL] "Find the most expensive boat with an image showing it on water; rate 5 stars and comment." The agent finds expensive listings and grounds the the visual predicate but is unable to execute the rating action reliably. Left, bottom: [PASS] "Cheapest wall rack between $30--$40 that matches the animal shape in the image." URL-level search and sorting prune the space, and an extraction step picks the correct visual match before navigation. Right, bottom: [PASS] "Latest white Google Pixel; post a $10-under offer." The agent locates the newest listing and uses post_comment to complete the interaction. Across successes, trajectories are short (2--5 calls) and dominated by URL/navigation and schema-checked operations.
  • ...and 3 more figures