Table of Contents
Fetching ...

NewtonBench: Benchmarking Generalizable Scientific Law Discovery in LLM Agents

Tianshi Zheng, Kelvin Kiu-Wai Tam, Newt Hue-Nam K. Nguyen, Baixuan Xu, Zhaowei Wang, Jiayang Cheng, Hong Ting Tsang, Weiqi Wang, Jiaxin Bai, Tianqing Fang, Yangqiu Song, Ginny Y. Wong, Simon See

TL;DR

NewtonBench presents the first scalable, scientifically authentic benchmark for generalizable scientific law discovery by coupling counterfactual law shifts with an interactive model-discovery environment. It shows that frontier LLMs exhibit a clear but fragile capability to rediscover laws, which deteriorates with system complexity and observation noise, and reveals a paradox where tool assistance can hinder stronger models via premature exploitation. The framework uses two metrics, Symbolic Accuracy and RMSLE, and a rigorous LLM-as-a-judge to assess equivalence, offering a principled platform to stress-test and guide next-generation AI agents toward genuine scientific discovery. By providing 324 tasks across 12 physics domains and a code-execution option, NewtonBench enables precise measurement of reasoning, exploration, and generalization in interactive scientific inquiry.

Abstract

Large language models are emerging as powerful tools for scientific law discovery, a foundational challenge in AI-driven science. However, existing benchmarks for this task suffer from a fundamental methodological trilemma, forcing a trade-off between scientific relevance, scalability, and resistance to memorization. Furthermore, they oversimplify discovery as static function fitting, failing to capture the authentic scientific process of uncovering embedded laws through the interactive exploration of complex model systems. To address these critical gaps, we introduce NewtonBench, a benchmark comprising 324 scientific law discovery tasks across 12 physics domains. Our design mitigates the evaluation trilemma by using counterfactual law shifts - systematic alterations of canonical laws - to generate a vast suite of problems that are scalable, scientifically relevant, and memorization-resistant. Moreover, we elevate the evaluation from static function fitting to interactive model discovery, requiring agents to experimentally probe simulated complex systems to uncover hidden principles. Our extensive experiment reveals a clear but fragile capability for discovery in frontier LLMs: this ability degrades precipitously with increasing system complexity and exhibits extreme sensitivity to observational noise. Notably, we uncover a paradoxical effect of tool assistance: providing a code interpreter can hinder more capable models by inducing a premature shift from exploration to exploitation, causing them to satisfice on suboptimal solutions. These results demonstrate that robust, generalizable discovery in complex, interactive environments remains the core challenge. By providing a scalable, robust, and scientifically authentic testbed, NewtonBench offers a crucial tool for measuring true progress and guiding the development of next-generation AI agents capable of genuine scientific discovery.

NewtonBench: Benchmarking Generalizable Scientific Law Discovery in LLM Agents

TL;DR

NewtonBench presents the first scalable, scientifically authentic benchmark for generalizable scientific law discovery by coupling counterfactual law shifts with an interactive model-discovery environment. It shows that frontier LLMs exhibit a clear but fragile capability to rediscover laws, which deteriorates with system complexity and observation noise, and reveals a paradox where tool assistance can hinder stronger models via premature exploitation. The framework uses two metrics, Symbolic Accuracy and RMSLE, and a rigorous LLM-as-a-judge to assess equivalence, offering a principled platform to stress-test and guide next-generation AI agents toward genuine scientific discovery. By providing 324 tasks across 12 physics domains and a code-execution option, NewtonBench enables precise measurement of reasoning, exploration, and generalization in interactive scientific inquiry.

Abstract

Large language models are emerging as powerful tools for scientific law discovery, a foundational challenge in AI-driven science. However, existing benchmarks for this task suffer from a fundamental methodological trilemma, forcing a trade-off between scientific relevance, scalability, and resistance to memorization. Furthermore, they oversimplify discovery as static function fitting, failing to capture the authentic scientific process of uncovering embedded laws through the interactive exploration of complex model systems. To address these critical gaps, we introduce NewtonBench, a benchmark comprising 324 scientific law discovery tasks across 12 physics domains. Our design mitigates the evaluation trilemma by using counterfactual law shifts - systematic alterations of canonical laws - to generate a vast suite of problems that are scalable, scientifically relevant, and memorization-resistant. Moreover, we elevate the evaluation from static function fitting to interactive model discovery, requiring agents to experimentally probe simulated complex systems to uncover hidden principles. Our extensive experiment reveals a clear but fragile capability for discovery in frontier LLMs: this ability degrades precipitously with increasing system complexity and exhibits extreme sensitivity to observational noise. Notably, we uncover a paradoxical effect of tool assistance: providing a code interpreter can hinder more capable models by inducing a premature shift from exploration to exploitation, causing them to satisfice on suboptimal solutions. These results demonstrate that robust, generalizable discovery in complex, interactive environments remains the core challenge. By providing a scalable, robust, and scientifically authentic testbed, NewtonBench offers a crucial tool for measuring true progress and guiding the development of next-generation AI agents capable of genuine scientific discovery.

Paper Structure

This paper contains 74 sections, 4 theorems, 5 equations, 21 figures, 21 tables, 1 algorithm.

Key Result

Lemma E.1

Under assumptions A1--A2, for any chosen input ${\bm{u}} \in U$ there exists a computable experiment input ${\bm{x}}({\bm{u}}) \in D$ such that, from the observed outputs $\mathcal{Y}_{\mathcal{M}}({\bm{x}}({\bm{u}}))$, the agent can compute a direct observation of $f_{\text{target}}({\bm{u}})$.

Figures (21)

  • Figure 1: An illustration of core designs in NewtonBench: experimentation with model system, counterfactual shifts in physical laws from various domains, and agentic exploration settings.
  • Figure 2: Impact of noise levels on performance.
  • Figure 3: Result across physics domains.
  • Figure 4: Inference cost among difficulty levels.
  • Figure 5: Results under different code-use budgets.
  • ...and 16 more figures

Theorems & Definitions (11)

  • Definition 1: Equation
  • Definition 2: Model
  • Definition 3: Task Formalization
  • Definition 4: Solvability
  • Definition 5: Evaluation Map and Separating Set
  • Lemma E.1: Path Inversion and Target Isolation
  • proof
  • Corollary E.2: Equivalence to Function Identification
  • Theorem E.3: Finite-Sample Identifiability of the Target Law
  • proof
  • ...and 1 more