Table of Contents
Fetching ...

Gravity-Bench-v1: A Benchmark on Gravitational Physics Discovery for Agents

Nolan Koblischke, Hyunseok Jang, Kristen Menou, Mohamad Ali-Dib

TL;DR

Gravity-Bench-v1 presents a gravity-driven, partially observable benchmark to evaluate AI agents' capacity for scientific discovery within dynamic environments. Agents must strategically plan data collection under a fixed observation budget and reason over accumulating observations to infer hidden quantities and orbital properties, including out-of-distribution dynamics such as drag and modified gravity. The benchmark spans 16 two-body simulations, 50 tasks, and 206 task-simulation pairs, with expert baselines and an open-ended solution space to encourage novel planning and reasoning strategies. Findings show that while full-data performance is achievable for some models, constrained observation planning remains a major bottleneck, highlighting the need for improved long-horizon reasoning and adaptive experimentation, with potential extensions to richer physics and reinforcement-learning-enabled agents.

Abstract

Modern science emerged from reasoning over repeatedly-observed planetary motions. We present Gravity-Bench-v1, an environment-based benchmark that challenges AI agents on tasks that parallel this historical development. Gravity-Bench-v1 evaluates agents on the discovery of physics concealed within a dynamic environment, using rigorous gravitational dynamics simulations. Gravity-Bench includes out-of-distribution cases, i.e. with physics that deviates from the real world, to evaluate true scientific generalization capabilities. Agents must plan to collect data within an experimental budget and must perform a dynamic form of data analysis and reasoning to solve tasks efficiently. Our benchmark admits an open-ended space of solutions. Reference solutions for each task are provided to calibrate AI performance against human expertise. Technically at an upper-undergraduate level, our benchmark proves challenging to baseline AI agents. Gravity-Bench-v1 and planned extensions should help map out AI progress towards scientific discovery capabilities.

Gravity-Bench-v1: A Benchmark on Gravitational Physics Discovery for Agents

TL;DR

Gravity-Bench-v1 presents a gravity-driven, partially observable benchmark to evaluate AI agents' capacity for scientific discovery within dynamic environments. Agents must strategically plan data collection under a fixed observation budget and reason over accumulating observations to infer hidden quantities and orbital properties, including out-of-distribution dynamics such as drag and modified gravity. The benchmark spans 16 two-body simulations, 50 tasks, and 206 task-simulation pairs, with expert baselines and an open-ended solution space to encourage novel planning and reasoning strategies. Findings show that while full-data performance is achievable for some models, constrained observation planning remains a major bottleneck, highlighting the need for improved long-horizon reasoning and adaptive experimentation, with potential extensions to richer physics and reinforcement-learning-enabled agents.

Abstract

Modern science emerged from reasoning over repeatedly-observed planetary motions. We present Gravity-Bench-v1, an environment-based benchmark that challenges AI agents on tasks that parallel this historical development. Gravity-Bench-v1 evaluates agents on the discovery of physics concealed within a dynamic environment, using rigorous gravitational dynamics simulations. Gravity-Bench includes out-of-distribution cases, i.e. with physics that deviates from the real world, to evaluate true scientific generalization capabilities. Agents must plan to collect data within an experimental budget and must perform a dynamic form of data analysis and reasoning to solve tasks efficiently. Our benchmark admits an open-ended space of solutions. Reference solutions for each task are provided to calibrate AI performance against human expertise. Technically at an upper-undergraduate level, our benchmark proves challenging to baseline AI agents. Gravity-Bench-v1 and planned extensions should help map out AI progress towards scientific discovery capabilities.

Paper Structure

This paper contains 20 sections, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Overview of Gravity-Bench-v1 architecture and workflow. The binary star simulation environment (left, green) generates orbital trajectories based on input parameters, including out-of-distribution physics like modified gravity laws. An agent (right, purple) must solve physics discovery tasks by strategically collecting observations through the observe tool (limited to a budget of 100 observations). We evaluate against expert solutions based on full simulation data access but using uniform sampling of 100 observations as a baseline without planning. This design tests both scientific reasoning and intelligent observation planning capabilities.
  • Figure 2: Overview of the gravitational simulations used in the benchmark. Each panel shows the orbital trajectories of a binary star system in the x-y plane, with masses indicated in solar masses ($M_\odot$). The color gradient indicates simulation progress from start (dark) to end (light). Simulations include standard orbits, systems that are unbound, systems with modified gravity, systems with drag forces, systems with proper motion, etc. The benchmark also includes versions of the same system represented in different units, to evaluate unit handling. Sample questions that could be asked about these systems are shown.
  • Figure 3: Agent performance in finding the maximum velocity of a star under various observational budgets.(a) Percent error for each agent as a function of the total number of observations used, where each point represents an individual run. Uniformly sampling in time with a expert solution (red line) serves as a planning-free baseline. Claude 3.5 Sonnet (blue) sometimes refines its observations enough to achieve under 1% error, while GPT-4o (orange) shows less consistent improvement. (b) Observations attempted by each agent as a function of the max allocated observation budget. Points show individual runs, while lines with error bars show the mean and standard error across runs for that budget. While an ideal approach would exploit all available observations (dashed line), both GPT-4o and Claude 3.5 Sonnet stop early, often using fewer than half of the available observations for budgets above 30. This underutilization highlights a lack of robust planning and answer verification. (c), (d) Percent error in finding the periastron distance in a single, highly elliptical orbit where the stars spend only 0.2% of the time within 5% of the closest approach. As discussed in Appendix \ref{['sec:another_case_study_on_planning']}, an expertly planned solution can achieve 2% error with 50 observations, but our uniform-sampling baseline with 100 observations (without planning) performs poorly (70% error), as do both AI agents. The horizontal dashed lines indicate the threshold by which the agents are marked in budget-obs-100.
  • Figure 4: Two observation-planning runs by Claude 3.5 Sonnet on the same task using 40 observations. The figure highlights how minor differences in planning lead to drastically different outcomes. In each run, the agent collects position data in multiple steps, computes velocity from finite differences, and refines its search for the peak velocity. Top panels: Excerpts of the agent's traces including the planning, observations, and code use. On the left, the agent systematically tracks the highest velocity times, and progressively refines its estimate, achieving a final error of only 2%. On the right, however, the agent never accurately records peak-velocity times, and proceeds to query intervals around low velocity times, resulting in a 45% error. It seems to misinterpret increasing velocity estimates from finer time resolution as evidence of higher true velocities, rather than as improved measurement accuracy. Bottom: True velocity curves (gray) overlaid with the agent's observations (colored dots). Later queries appear in brighter hues, showing how an intelligently planned approach can converge near to correct velocity, while a misplanned approach (right) fails to capture the velocity peak.
  • Figure 5: Finding task-specific thresholds based on expert solutions performance without planning (expert-ref-100) Each black dot represents the absolute percent difference (relative to a solution with access to the full simulated data) for simulation-task pairs grouped by task, using 100 uniformly spaced observations. Scenarios yielding large differences (e.g., 30-100% or more) show that naive uniform sampling is inadequate as a planning strategy. Red horizontal lines mark the final success threshold we choose for each task, that we use for all the simulations that task is based on.
  • ...and 2 more figures