Table of Contents
Fetching ...

BoxingGym: Benchmarking Progress in Automated Experimental Design and Model Discovery

Kanishk Gandhi, Michael Y. Li, Lyle Goodyear, Agam Bhatia, Louise Li, Aditi Bhaskar, Mohammed Zaman, Noah D. Goodman

TL;DR

BoxingGym presents a holistic benchmark that simultaneously evaluates experimental design and model discovery for language-based scientific agents. The framework embeds ten real-world-inspired probabilistic environments, enabling active experimentation and theory revision, with evaluation through expected information gain, explanation-based communication, and goal-driven prediction. Across model scales and architectures, larger closed-source LLMs show stronger performance, yet substantial variability remains across domains, and augmenting with explicit statistical modeling yields inconsistent gains. The work highlights both the promise and the current limitations of automated scientific reasoning, and suggests directions for improving autonomous exploration, explanation quality, and domain diversity in future research.

Abstract

Understanding the world and explaining it with scientific theories is a central aspiration of artificial intelligence research. Proposing theories, designing experiments to test them, and then revising them based on data are fundamental to scientific discovery. Despite the significant promise of LLM-based scientific agents, no benchmarks systematically test LLM's ability to propose scientific models, collect experimental data, and revise them in light of new data. We introduce BoxingGym, a benchmark with 10 environments for systematically evaluating both experimental design (e.g. collecting data to test a scientific theory) and model discovery (e.g. proposing and revising scientific theories). To enable tractable and quantitative evaluation, we implement each environment as a generative probabilistic model with which a scientific agent can run interactive experiments. These probabilistic models are drawn from various real-world scientific domains ranging from psychology to ecology. To quantitatively evaluate a scientific agent's ability to collect informative experimental data, we compute the expected information gain (EIG), an information-theoretic quantity which measures how much an experiment reduces uncertainty about the parameters of a generative model. A good scientific theory is a concise and predictive explanation. Therefore, to quantitatively evaluate model discovery, we ask a scientific agent to explain their model and then assess whether this explanation enables another scientific agent to make reliable predictions about this environment. In addition to this explanation-based evaluation, we compute standard model evaluation metrics such as prediction errors. We find that current LLMs, such as GPT-4o, struggle with both experimental design and model discovery. We find that augmenting the LLM-based agent with an explicit statistical model does not reliably improve these results.

BoxingGym: Benchmarking Progress in Automated Experimental Design and Model Discovery

TL;DR

BoxingGym presents a holistic benchmark that simultaneously evaluates experimental design and model discovery for language-based scientific agents. The framework embeds ten real-world-inspired probabilistic environments, enabling active experimentation and theory revision, with evaluation through expected information gain, explanation-based communication, and goal-driven prediction. Across model scales and architectures, larger closed-source LLMs show stronger performance, yet substantial variability remains across domains, and augmenting with explicit statistical modeling yields inconsistent gains. The work highlights both the promise and the current limitations of automated scientific reasoning, and suggests directions for improving autonomous exploration, explanation quality, and domain diversity in future research.

Abstract

Understanding the world and explaining it with scientific theories is a central aspiration of artificial intelligence research. Proposing theories, designing experiments to test them, and then revising them based on data are fundamental to scientific discovery. Despite the significant promise of LLM-based scientific agents, no benchmarks systematically test LLM's ability to propose scientific models, collect experimental data, and revise them in light of new data. We introduce BoxingGym, a benchmark with 10 environments for systematically evaluating both experimental design (e.g. collecting data to test a scientific theory) and model discovery (e.g. proposing and revising scientific theories). To enable tractable and quantitative evaluation, we implement each environment as a generative probabilistic model with which a scientific agent can run interactive experiments. These probabilistic models are drawn from various real-world scientific domains ranging from psychology to ecology. To quantitatively evaluate a scientific agent's ability to collect informative experimental data, we compute the expected information gain (EIG), an information-theoretic quantity which measures how much an experiment reduces uncertainty about the parameters of a generative model. A good scientific theory is a concise and predictive explanation. Therefore, to quantitatively evaluate model discovery, we ask a scientific agent to explain their model and then assess whether this explanation enables another scientific agent to make reliable predictions about this environment. In addition to this explanation-based evaluation, we compute standard model evaluation metrics such as prediction errors. We find that current LLMs, such as GPT-4o, struggle with both experimental design and model discovery. We find that augmenting the LLM-based agent with an explicit statistical model does not reliably improve these results.
Paper Structure (65 sections, 20 equations, 12 figures, 12 tables)

This paper contains 65 sections, 20 equations, 12 figures, 12 tables.

Figures (12)

  • Figure 1: Overview of BoxingGym. The BoxingGym Framework is designed to holistically evaluate experimental design and model discovery capabilities in the spirit of George Box box1976science. 1) The process starts with a user defining a goal for the scientist agent. 2) The scientist formulates a theory. 3) This theory guides the experimental design, where the scientist interacts with a simulated world to gather new data. 4) The scientist then analyzes the new and old data to propose and refine theories. This iterative process continues for several iterations. 5) The scientist is then asked to explain the findings to a novice. 6) We evaluate the novice and the scientist by casting the goal as a prediction problem.
  • Figure 2: Python pseudocode examples.(left)BoxingGym is instantiated as modular classes and methods for the environment (WorldEnv), goals (Goal), and agents (Agent). (center) Pseudocode illustrating the workflow of setting goals, performing experiments, predicting outcomes, and providing explanations. (right) An example, hyperbolic temporal discounting, where the agent predicts a participant's choice between immediate and delayed rewards and explains the concept to a novice.
  • Figure 3: Normalized Error Compared across Models. (a) Comparison of the normalized errors for different LLMs with or without prior information included in the prompt. (b) Comparison of reasoning models (OpenThinker) and instruct models (Qwen) across environments. Error bars are the standard error across 5 runs.
  • Figure 4: Normalized Errors Over Number of Observations. Normalized errors for the LLM agent with gpt-4o with prior information (solid blue) and without prior information (dotted yellow) across three domains: Population Growth Dynamics (left), IRT (center) and Hyperbolic Discounting (right). Error bars are the standard error across 5 runs.
  • Figure 5: (a) Comparison of the Box's Apprentice with an LLM agent. (b) EIG Regret scores for six large language models, with lower values indicating better performance.
  • ...and 7 more figures