Gravity-Bench-v1: A Benchmark on Gravitational Physics Discovery for Agents
Nolan Koblischke, Hyunseok Jang, Kristen Menou, Mohamad Ali-Dib
TL;DR
Gravity-Bench-v1 presents a gravity-driven, partially observable benchmark to evaluate AI agents' capacity for scientific discovery within dynamic environments. Agents must strategically plan data collection under a fixed observation budget and reason over accumulating observations to infer hidden quantities and orbital properties, including out-of-distribution dynamics such as drag and modified gravity. The benchmark spans 16 two-body simulations, 50 tasks, and 206 task-simulation pairs, with expert baselines and an open-ended solution space to encourage novel planning and reasoning strategies. Findings show that while full-data performance is achievable for some models, constrained observation planning remains a major bottleneck, highlighting the need for improved long-horizon reasoning and adaptive experimentation, with potential extensions to richer physics and reinforcement-learning-enabled agents.
Abstract
Modern science emerged from reasoning over repeatedly-observed planetary motions. We present Gravity-Bench-v1, an environment-based benchmark that challenges AI agents on tasks that parallel this historical development. Gravity-Bench-v1 evaluates agents on the discovery of physics concealed within a dynamic environment, using rigorous gravitational dynamics simulations. Gravity-Bench includes out-of-distribution cases, i.e. with physics that deviates from the real world, to evaluate true scientific generalization capabilities. Agents must plan to collect data within an experimental budget and must perform a dynamic form of data analysis and reasoning to solve tasks efficiently. Our benchmark admits an open-ended space of solutions. Reference solutions for each task are provided to calibrate AI performance against human expertise. Technically at an upper-undergraduate level, our benchmark proves challenging to baseline AI agents. Gravity-Bench-v1 and planned extensions should help map out AI progress towards scientific discovery capabilities.
