Table of Contents
Fetching ...

WebGuard: Building a Generalizable Guardrail for Web Agents

Boyuan Zheng, Zeyi Liao, Scott Salisbury, Zeyuan Liu, Michael Lin, Qinyuan Zheng, Zifan Wang, Xiang Deng, Dawn Song, Huan Sun, Yu Su

TL;DR

WebGuard tackles the safety challenge of autonomous web agents by providing a large-scale action-level dataset with a three-tier risk schema to predict state-changing actions. It benchmarks frontier LLMs and demonstrates substantial gaps, then shows that supervised fine-tuning with WebGuard can dramatically boost accuracy and high-risk recall, though reliability remains insufficient for deployment. The work emphasizes generalization across domains and tail websites and proposes integrating guardrails with user-in-the-loop control, while committing to open-source resources to accelerate progress. By releasing datasets, tools, and trained models, it aims to accelerate progress toward robust, generalizable web guardrails.

Abstract

The rapid development of autonomous web agents powered by Large Language Models (LLMs), while greatly elevating efficiency, exposes the frontier risk of taking unintended or harmful actions. This situation underscores an urgent need for effective safety measures, akin to access controls for human users. To address this critical challenge, we introduce WebGuard, the first comprehensive dataset designed to support the assessment of web agent action risks and facilitate the development of guardrails for real-world online environments. In doing so, WebGuard specifically focuses on predicting the outcome of state-changing actions and contains 4,939 human-annotated actions from 193 websites across 22 diverse domains, including often-overlooked long-tail websites. These actions are categorized using a novel three-tier risk schema: SAFE, LOW, and HIGH. The dataset includes designated training and test splits to support evaluation under diverse generalization settings. Our initial evaluations reveal a concerning deficiency: even frontier LLMs achieve less than 60% accuracy in predicting action outcomes and less than 60% recall in lagging HIGH-risk actions, highlighting the risks of deploying current-generation agents without dedicated safeguards. We therefore investigate fine-tuning specialized guardrail models using WebGuard. We conduct comprehensive evaluations across multiple generalization settings and find that a fine-tuned Qwen2.5VL-7B model yields a substantial improvement in performance, boosting accuracy from 37% to 80% and HIGH-risk action recall from 20% to 76%. Despite these improvements, the performance still falls short of the reliability required for high-stakes deployment, where guardrails must approach near-perfect accuracy and recall.

WebGuard: Building a Generalizable Guardrail for Web Agents

TL;DR

WebGuard tackles the safety challenge of autonomous web agents by providing a large-scale action-level dataset with a three-tier risk schema to predict state-changing actions. It benchmarks frontier LLMs and demonstrates substantial gaps, then shows that supervised fine-tuning with WebGuard can dramatically boost accuracy and high-risk recall, though reliability remains insufficient for deployment. The work emphasizes generalization across domains and tail websites and proposes integrating guardrails with user-in-the-loop control, while committing to open-source resources to accelerate progress. By releasing datasets, tools, and trained models, it aims to accelerate progress toward robust, generalizable web guardrails.

Abstract

The rapid development of autonomous web agents powered by Large Language Models (LLMs), while greatly elevating efficiency, exposes the frontier risk of taking unintended or harmful actions. This situation underscores an urgent need for effective safety measures, akin to access controls for human users. To address this critical challenge, we introduce WebGuard, the first comprehensive dataset designed to support the assessment of web agent action risks and facilitate the development of guardrails for real-world online environments. In doing so, WebGuard specifically focuses on predicting the outcome of state-changing actions and contains 4,939 human-annotated actions from 193 websites across 22 diverse domains, including often-overlooked long-tail websites. These actions are categorized using a novel three-tier risk schema: SAFE, LOW, and HIGH. The dataset includes designated training and test splits to support evaluation under diverse generalization settings. Our initial evaluations reveal a concerning deficiency: even frontier LLMs achieve less than 60% accuracy in predicting action outcomes and less than 60% recall in lagging HIGH-risk actions, highlighting the risks of deploying current-generation agents without dedicated safeguards. We therefore investigate fine-tuning specialized guardrail models using WebGuard. We conduct comprehensive evaluations across multiple generalization settings and find that a fine-tuned Qwen2.5VL-7B model yields a substantial improvement in performance, boosting accuracy from 37% to 80% and HIGH-risk action recall from 20% to 76%. Despite these improvements, the performance still falls short of the reliability required for high-stakes deployment, where guardrails must approach near-perfect accuracy and recall.

Paper Structure

This paper contains 22 sections, 1 equation, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Data samples from WebGuard, where elements are labelled as SAFE (green), LOW (orange), and HIGH (red) risk labels. Safe actions include those that have no long-term effect, like page navigation, while low-risk actions may have minor consequences. High-risk actions are often irreversible and can have a substantial effect on the user and website (e.g., scheduling a test drive).
  • Figure 2: Distribution of websites and domains in WebGuard.
  • Figure 3: Demonstration of using WebGuard guardrail with web agents.
  • Figure 4: Annotation Interface
  • Figure 5: Example of a Misclassified Intermediate Action. The proposed action involves clicking a checkbox to confirm the user’s acknowledgment of privacy and disclaimer terms during the U.S. passport application process. Although part of a high-stakes workflow, this specific action is a reversible and standard procedural step with no immediate legal, financial, or ethical consequences. The model incorrectly classifies this action as LOW risk, while the ground truth label is SAFE, highlighting the model’s tendency to overestimate the risk of intermediate steps within a broader state-changing operation.
  • ...and 1 more figures