Table of Contents
Fetching ...

DISCOVERYWORLD: A Virtual Environment for Developing and Evaluating Automated Scientific Discovery Agents

Peter Jansen, Marc-Alexandre Côté, Tushar Khot, Erin Bransom, Bhavana Dalvi Mishra, Bodhisattwa Prasad Majumder, Oyvind Tafjord, Peter Clark

TL;DR

DISCOVERYWORLD introduces a text-based virtual environment to benchmark end-to-end scientific discovery across eight domains, enabling agents to ideate, design experiments, collect data, and draw explanatory conclusions. The platform generates 120 task instances via parametric templates across 8 themes and 3 difficulties, with unit tests to separate discovery-specific reasoning from routine tasks, and a Gym-like API for interaction. Evaluation relies on three automatic metrics—task completion, task-relevant actions, and explanatory knowledge accuracy—complemented by GPT-4o-based automatic knowledge scoring and human baselines. Findings show humans solve a majority of tasks and demonstrate robust discovery knowledge, while strong baselines (ReAct, Plan+Execute, Hypothesizer) struggle with end-to-end discovery, highlighting substantial room for advancing general AI discovery capabilities. DISCOVERYWORLD aims to accelerate progress by providing a broad, open benchmark with automatic evaluation tools and a commitment to community involvement through open-source release.

Abstract

Automated scientific discovery promises to accelerate progress across scientific domains. However, developing and evaluating an AI agent's capacity for end-to-end scientific reasoning is challenging as running real-world experiments is often prohibitively expensive or infeasible. In this work we introduce DISCOVERYWORLD, the first virtual environment for developing and benchmarking an agent's ability to perform complete cycles of novel scientific discovery. DISCOVERYWORLD contains a variety of different challenges, covering topics as diverse as radioisotope dating, rocket science, and proteomics, to encourage development of general discovery skills rather than task-specific solutions. DISCOVERYWORLD itself is an inexpensive, simulated, text-based environment (with optional 2D visual overlay). It includes 120 different challenge tasks, spanning eight topics each with three levels of difficulty and several parametric variations. Each task requires an agent to form hypotheses, design and run experiments, analyze results, and act on conclusions. DISCOVERYWORLD further provides three automatic metrics for evaluating performance, based on (a) task completion, (b) task-relevant actions taken, and (c) the discovered explanatory knowledge. We find that strong baseline agents, that perform well in prior published environments, struggle on most DISCOVERYWORLD tasks, suggesting that DISCOVERYWORLD captures some of the novel challenges of discovery, and thus that DISCOVERYWORLD may help accelerate near-term development and assessment of scientific discovery competency in agents. Code available at: www.github.com/allenai/discoveryworld

DISCOVERYWORLD: A Virtual Environment for Developing and Evaluating Automated Scientific Discovery Agents

TL;DR

DISCOVERYWORLD introduces a text-based virtual environment to benchmark end-to-end scientific discovery across eight domains, enabling agents to ideate, design experiments, collect data, and draw explanatory conclusions. The platform generates 120 task instances via parametric templates across 8 themes and 3 difficulties, with unit tests to separate discovery-specific reasoning from routine tasks, and a Gym-like API for interaction. Evaluation relies on three automatic metrics—task completion, task-relevant actions, and explanatory knowledge accuracy—complemented by GPT-4o-based automatic knowledge scoring and human baselines. Findings show humans solve a majority of tasks and demonstrate robust discovery knowledge, while strong baselines (ReAct, Plan+Execute, Hypothesizer) struggle with end-to-end discovery, highlighting substantial room for advancing general AI discovery capabilities. DISCOVERYWORLD aims to accelerate progress by providing a broad, open benchmark with automatic evaluation tools and a commitment to community involvement through open-source release.

Abstract

Automated scientific discovery promises to accelerate progress across scientific domains. However, developing and evaluating an AI agent's capacity for end-to-end scientific reasoning is challenging as running real-world experiments is often prohibitively expensive or infeasible. In this work we introduce DISCOVERYWORLD, the first virtual environment for developing and benchmarking an agent's ability to perform complete cycles of novel scientific discovery. DISCOVERYWORLD contains a variety of different challenges, covering topics as diverse as radioisotope dating, rocket science, and proteomics, to encourage development of general discovery skills rather than task-specific solutions. DISCOVERYWORLD itself is an inexpensive, simulated, text-based environment (with optional 2D visual overlay). It includes 120 different challenge tasks, spanning eight topics each with three levels of difficulty and several parametric variations. Each task requires an agent to form hypotheses, design and run experiments, analyze results, and act on conclusions. DISCOVERYWORLD further provides three automatic metrics for evaluating performance, based on (a) task completion, (b) task-relevant actions taken, and (c) the discovered explanatory knowledge. We find that strong baseline agents, that perform well in prior published environments, struggle on most DISCOVERYWORLD tasks, suggesting that DISCOVERYWORLD captures some of the novel challenges of discovery, and thus that DISCOVERYWORLD may help accelerate near-term development and assessment of scientific discovery competency in agents. Code available at: www.github.com/allenai/discoveryworld
Paper Structure (58 sections, 4 figures, 10 tables)

This paper contains 58 sections, 4 figures, 10 tables.

Figures (4)

  • Figure 1: DiscoveryWorld is a virtual environment for developing and evaluating discovery agents, with challenge tasks covering a broad variety of different topics such as those shown above.
  • Figure 2: DiscoveryWorld tasks require end-to-end scientific discovery, from ideation, hypothesis formation, experiment design, data collection and analysis, forming conclusions, and acting on results. Distractors and task solutions that provide only descriptive discoveries require agents to frequently iterate hypotheses and experiments to reach full explanatory discoveries.
  • Figure 3: Example instances of the 10 Unit Test themes.
  • Figure 4: Example of the user interface the participants used. This is for the It's (not) Rocket Science! theme.