WAREX: Web Agent Reliability Evaluation on Existing Benchmarks
Su Kara, Fazle Faisal, Suman Nath
TL;DR
WAREX addresses the gap between lab-based web-agent benchmarks and real-world operation by introducing a network-layer fault-injection proxy that can simulate common web failures and adversarial content without modifying existing benchmarks or agent code. By coupling fault injection with detailed efficiency logging, it enables robust assessment of both reliability and cost across multiple popular benchmarks (WebArena, REAL, WebVoyager). The study demonstrates substantial degradation in task success under faults, highlights model-dependent differences in recovery and behavior, and shows that prompting can partially mitigate some failures. This framework provides a practical, extensible approach to stress-testing web agents, with implications for safer deployment and targeted improvements in robustness and resilience.
Abstract
Recent advances in browser-based LLM agents have shown promise for automating tasks ranging from simple form filling to hotel booking or online shopping. Current benchmarks measure agent performance in controlled environments, such as containers or stable networks, where websites behave deterministically. However, in the real world, users access websites over networks and HTTPS connections that introduce instability from multiple sources: client-side, server-side issues or broader system failures. Moreover, live websites are prone to web attacks such Cross-Site Scripting, as well as general site modifications which can cause unexpected or malicious pop-ups or improper functionality. To address this gap, we present WAREX: Web Agent Reliability Evaluation on Existing Benchmarks. We measure the impact of WAREX across three popular benchmarks: WebArena, WebVoyager, and REAL. Our experiments show that introducing WAREX leads to significant drops in task success rates, highlighting the limited robustness of state-of-the-art agents.
