Table of Contents
Fetching ...

WAREX: Web Agent Reliability Evaluation on Existing Benchmarks

Su Kara, Fazle Faisal, Suman Nath

TL;DR

WAREX addresses the gap between lab-based web-agent benchmarks and real-world operation by introducing a network-layer fault-injection proxy that can simulate common web failures and adversarial content without modifying existing benchmarks or agent code. By coupling fault injection with detailed efficiency logging, it enables robust assessment of both reliability and cost across multiple popular benchmarks (WebArena, REAL, WebVoyager). The study demonstrates substantial degradation in task success under faults, highlights model-dependent differences in recovery and behavior, and shows that prompting can partially mitigate some failures. This framework provides a practical, extensible approach to stress-testing web agents, with implications for safer deployment and targeted improvements in robustness and resilience.

Abstract

Recent advances in browser-based LLM agents have shown promise for automating tasks ranging from simple form filling to hotel booking or online shopping. Current benchmarks measure agent performance in controlled environments, such as containers or stable networks, where websites behave deterministically. However, in the real world, users access websites over networks and HTTPS connections that introduce instability from multiple sources: client-side, server-side issues or broader system failures. Moreover, live websites are prone to web attacks such Cross-Site Scripting, as well as general site modifications which can cause unexpected or malicious pop-ups or improper functionality. To address this gap, we present WAREX: Web Agent Reliability Evaluation on Existing Benchmarks. We measure the impact of WAREX across three popular benchmarks: WebArena, WebVoyager, and REAL. Our experiments show that introducing WAREX leads to significant drops in task success rates, highlighting the limited robustness of state-of-the-art agents.

WAREX: Web Agent Reliability Evaluation on Existing Benchmarks

TL;DR

WAREX addresses the gap between lab-based web-agent benchmarks and real-world operation by introducing a network-layer fault-injection proxy that can simulate common web failures and adversarial content without modifying existing benchmarks or agent code. By coupling fault injection with detailed efficiency logging, it enables robust assessment of both reliability and cost across multiple popular benchmarks (WebArena, REAL, WebVoyager). The study demonstrates substantial degradation in task success under faults, highlights model-dependent differences in recovery and behavior, and shows that prompting can partially mitigate some failures. This framework provides a practical, extensible approach to stress-testing web agents, with implications for safer deployment and targeted improvements in robustness and resilience.

Abstract

Recent advances in browser-based LLM agents have shown promise for automating tasks ranging from simple form filling to hotel booking or online shopping. Current benchmarks measure agent performance in controlled environments, such as containers or stable networks, where websites behave deterministically. However, in the real world, users access websites over networks and HTTPS connections that introduce instability from multiple sources: client-side, server-side issues or broader system failures. Moreover, live websites are prone to web attacks such Cross-Site Scripting, as well as general site modifications which can cause unexpected or malicious pop-ups or improper functionality. To address this gap, we present WAREX: Web Agent Reliability Evaluation on Existing Benchmarks. We measure the impact of WAREX across three popular benchmarks: WebArena, WebVoyager, and REAL. Our experiments show that introducing WAREX leads to significant drops in task success rates, highlighting the limited robustness of state-of-the-art agents.

Paper Structure

This paper contains 18 sections, 7 figures, 2 tables.

Figures (7)

  • Figure 1: WAREX Framework. A network proxy splits TLS between client and server and runs an injection script with web failure logic: Type (failure mode — network delay, 5xx, JS error, popup) and Frequency (targeting injection policy — exact/regex URL(s); k'th/every-k/random n occurrences). The proxy rewrites selected responses and returns a modified page to the agent which it uses to decide its next action, while the server remains unchanged.
  • Figure 2: Default home page for Omnizon task type in REAL benchmark with no fault injected.
  • Figure 3: Unreliable scenarios created using the WAREX framework.
  • Figure 4: Main Experiment. Average (a) Success Rate, (b) Latency, (c) Number of LLM calls, (d) Cost per Task for each web failure type in the legend above on each benchmark. All 660 WebArena, 112 REAL, and 643 WebVoyager tasks are considered, and we use GPT-4o as the backbone.
  • Figure 5: Efficiency metric comparison between the Failure Scenarios and Improved or "Fixed" versions with prompting for WebArena (contrast with scores in Figure 4).
  • ...and 2 more figures