Table of Contents
Fetching ...

REAL: Benchmarking Autonomous Agents on Deterministic Simulations of Real Websites

Divyansh Garg, Shaun VanWeelden, Diego Caples, Andis Draguns, Nikil Ravi, Pranav Putta, Naman Garg, Tomas Abraham, Michael Lara, Federico Lopez, James Liu, Atharva Gundawar, Prannay Hebbar, Youngchul Joo, Jindong Gu, Charles London, Christian Schroeder de Witt, Sumeet Motwani

TL;DR

REAL provides a deterministic, hosted suite of 11 realistic website replicas and 112 evaluation tasks to rigorously test autonomous web agents. It combines programmatic state checks for action tasks and rubric-based LLM judgments for information tasks, with a flexible harness supporting both open-source and proprietary agents. The framework enables reproducible experimentation and post-training data generation, highlighting substantial gaps in current agent capabilities as frontier models achieve only around forty percent success. By offering configurable environments, a public leaderboard, and easy integration, REAL aims to accelerate the development of reliable, real-world web agents and facilitate broader RL and planning research in web navigation.

Abstract

We introduce REAL, a benchmark and framework for multi-turn agent evaluations on deterministic simulations of real-world websites. REAL comprises high-fidelity, deterministic replicas of 11 widely-used websites across domains such as e-commerce, travel, communication, and professional networking. We also release a benchmark consisting of 112 practical tasks that mirror everyday complex user interactions requiring both accurate information retrieval and state-changing actions. All interactions occur within this fully controlled setting, eliminating safety risks and enabling robust, reproducible evaluation of agent capability and reliability. Our novel evaluation framework combines programmatic checks of website state for action-based tasks with rubric-guided LLM-based judgments for information retrieval. The framework supports both open-source and proprietary agent systems through a flexible evaluation harness that accommodates black-box commands within browser environments, allowing research labs to test agentic systems without modification. Our empirical results show that frontier language models achieve at most a 41% success rate on REAL, highlighting critical gaps in autonomous web navigation and task completion capabilities. Our framework supports easy integration of new tasks, reproducible evaluation, and scalable post-training data generation, marking a significant step forward in evaluating and advancing agent capabilities.

REAL: Benchmarking Autonomous Agents on Deterministic Simulations of Real Websites

TL;DR

REAL provides a deterministic, hosted suite of 11 realistic website replicas and 112 evaluation tasks to rigorously test autonomous web agents. It combines programmatic state checks for action tasks and rubric-based LLM judgments for information tasks, with a flexible harness supporting both open-source and proprietary agents. The framework enables reproducible experimentation and post-training data generation, highlighting substantial gaps in current agent capabilities as frontier models achieve only around forty percent success. By offering configurable environments, a public leaderboard, and easy integration, REAL aims to accelerate the development of reliable, real-world web agents and facilitate broader RL and planning research in web navigation.

Abstract

We introduce REAL, a benchmark and framework for multi-turn agent evaluations on deterministic simulations of real-world websites. REAL comprises high-fidelity, deterministic replicas of 11 widely-used websites across domains such as e-commerce, travel, communication, and professional networking. We also release a benchmark consisting of 112 practical tasks that mirror everyday complex user interactions requiring both accurate information retrieval and state-changing actions. All interactions occur within this fully controlled setting, eliminating safety risks and enabling robust, reproducible evaluation of agent capability and reliability. Our novel evaluation framework combines programmatic checks of website state for action-based tasks with rubric-guided LLM-based judgments for information retrieval. The framework supports both open-source and proprietary agent systems through a flexible evaluation harness that accommodates black-box commands within browser environments, allowing research labs to test agentic systems without modification. Our empirical results show that frontier language models achieve at most a 41% success rate on REAL, highlighting critical gaps in autonomous web navigation and task completion capabilities. Our framework supports easy integration of new tasks, reproducible evaluation, and scalable post-training data generation, marking a significant step forward in evaluating and advancing agent capabilities.

Paper Structure

This paper contains 27 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: The REAL benchmark and framework. REAL provides 11 realistic, deterministic, high-fidelity web environments (across e-commerce, networking, communication, scheduling, booking, project management) and 110+ evaluation tasks. An agent interacting with the environments receives an observation ($o_t$) and executes actions ($a_t$) to complete a task. Upon completion, an outcome reward ($r_T$) is evaluated via programmatic state verification and/or a rubric based LLM-judge.
  • Figure 2: Screenshots of representative web environments included in REAL (8 of 11 shown). These are high-fidelity, deterministic replicas of popular websites, hosted by us for easy accessibility. These environments feature complex, multi-page workflows with persistent state management on the browser, allowing detailed tracking and inspection of state changes induced by agent actions.
  • Figure 3: Performance of evaluated models on the REAL benchmark, measured by end-to-end task success rate of our baseline agent across 112 tasks. Claude 3.7 Sonnet-Thinking achieves 41.07%.
  • Figure 4: A per-website performance breakdown for several frontier models across REAL environments. TopWork and FlyUnified are consistently the most challenging environments.