WebGames: Challenging General-Purpose Web-Browsing AI Agents

George Thomas; Alex J. Chan; Jikun Kang; Wenqi Wu; Filippos Christianos; Fraser Greenlee; Andy Toulis; Marvin Purtorab

WebGames: Challenging General-Purpose Web-Browsing AI Agents

George Thomas, Alex J. Chan, Jikun Kang, Wenqi Wu, Filippos Christianos, Fraser Greenlee, Andy Toulis, Marvin Purtorab

TL;DR

WebGames provides a hermetic, client-side benchmark of 50+ interactive web challenges to measure general-purpose web-browsing agents across fundamental interactions, advanced input, cognition, workflow automation, and entertainment. Using a Set-of-Marks scaffolding and ReAct prompting within a POMDP framework, it couples deterministic verification with ground-truth solutions to compare leading vision-language models against human baselines, revealing a large performance gap with the best model at $41.2\%$ vs humans at $95.7\%$. The work highlights core limitations in current AI systems for web interactions, and shows that specialized proxy architectures can bridge some gaps, illustrating both the potential and the need for targeted improvements. Availability of a lightweight, client-side implementation and modular challenge design facilitates rapid evaluation and progress tracking, while future directions aim to expand difficulty, multi-agent coordination, and richer metrics.

Abstract

We introduce WebGames, a comprehensive benchmark suite designed to evaluate general-purpose web-browsing AI agents through a collection of 50+ interactive challenges. These challenges are specifically crafted to be straightforward for humans while systematically testing the limitations of current AI systems across fundamental browser interactions, advanced input processing, cognitive tasks, workflow automation, and interactive entertainment. Our framework eliminates external dependencies through a hermetic testing environment, ensuring reproducible evaluation with verifiable ground-truth solutions. We evaluate leading vision-language models including GPT-4o, Claude Computer-Use, Gemini-1.5-Pro, and Qwen2-VL against human performance. Results reveal a substantial capability gap, with the best AI system achieving only 43.1% success rate compared to human performance of 95.7%, highlighting fundamental limitations in current AI systems' ability to handle common web interaction patterns that humans find intuitive. The benchmark is publicly available at webgames.convergence.ai, offering a lightweight, client-side implementation that facilitates rapid evaluation cycles. Through its modular architecture and standardized challenge specifications, WebGames provides a robust foundation for measuring progress in development of more capable web-browsing agents.

WebGames: Challenging General-Purpose Web-Browsing AI Agents

TL;DR

Abstract

WebGames: Challenging General-Purpose Web-Browsing AI Agents

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (2)