Table of Contents
Fetching ...

Physics simulation capabilities of LLMs

Mohamad Ali-Dib, Kristen Menou

TL;DR

The paper assesses whether state-of-the-art LLMs can meaningfully contribute to graduate- and PhD-level computational physics by generating and validating code that uses open-source simulation tools. It introduces a four-class physics task complexity framework and a ~50-problem benchmark spanning celestial mechanics, stellar physics, 1D fluid dynamics, and nonlinear dynamics, implemented with REBOUND, MESA, Dedalus, and SciPy. The results show that GPT-4 currently achieves no full autonomous graduate-level solutions, but approximately 40% of the problems yield passing-grade outputs, with 70–90% of code lines being necessary, sufficient, and correct; inconsistencies in units, versions, and physics modeling are key failure modes. The work highlights practical failure modes (unit handling, version drift, module hallucinations, and gaps in physical modeling) and outlines concrete targets for improving AI-assisted physics simulations, offering a snapshot and roadmap for future autonomous scientific computation. Overall, the study provides a foundation for evaluating and steering AI systems toward reliable, simulation-based reasoning in physics.

Abstract

[Abridged abstract] Large Language Models (LLMs) can solve some undergraduate-level to graduate-level physics textbook problems and are proficient at coding. Combining these two capabilities could one day enable AI systems to simulate and predict the physical world. We present an evaluation of state-of-the-art (SOTA) LLMs on PhD-level to research-level computational physics problems. We condition LLM generation on the use of well-documented and widely-used packages to elicit coding capabilities in the physics and astrophysics domains. We contribute $\sim 50$ original and challenging problems in celestial mechanics (with REBOUND), stellar physics (with MESA), 1D fluid dynamics (with Dedalus) and non-linear dynamics (with SciPy). Since our problems do not admit unique solutions, we evaluate LLM performance on several soft metrics: counts of lines that contain different types of errors (coding, physics, necessity and sufficiency) as well as a more "educational" Pass-Fail metric focused on capturing the salient physical ingredients of the problem at hand. As expected, today's SOTA LLM (GPT4) zero-shot fails most of our problems, although about 40\% of the solutions could plausibly get a passing grade. About $70-90 \%$ of the code lines produced are necessary, sufficient and correct (coding \& physics). Physics and coding errors are the most common, with some unnecessary or insufficient lines. We observe significant variations across problem class and difficulty. We identify several failure modes of GPT4 in the computational physics domain. Our reconnaissance work provides a snapshot of current computational capabilities in classical physics and points to obvious improvement targets if AI systems are ever to reach a basic level of autonomy in physics simulation capabilities.

Physics simulation capabilities of LLMs

TL;DR

The paper assesses whether state-of-the-art LLMs can meaningfully contribute to graduate- and PhD-level computational physics by generating and validating code that uses open-source simulation tools. It introduces a four-class physics task complexity framework and a ~50-problem benchmark spanning celestial mechanics, stellar physics, 1D fluid dynamics, and nonlinear dynamics, implemented with REBOUND, MESA, Dedalus, and SciPy. The results show that GPT-4 currently achieves no full autonomous graduate-level solutions, but approximately 40% of the problems yield passing-grade outputs, with 70–90% of code lines being necessary, sufficient, and correct; inconsistencies in units, versions, and physics modeling are key failure modes. The work highlights practical failure modes (unit handling, version drift, module hallucinations, and gaps in physical modeling) and outlines concrete targets for improving AI-assisted physics simulations, offering a snapshot and roadmap for future autonomous scientific computation. Overall, the study provides a foundation for evaluating and steering AI systems toward reliable, simulation-based reasoning in physics.

Abstract

[Abridged abstract] Large Language Models (LLMs) can solve some undergraduate-level to graduate-level physics textbook problems and are proficient at coding. Combining these two capabilities could one day enable AI systems to simulate and predict the physical world. We present an evaluation of state-of-the-art (SOTA) LLMs on PhD-level to research-level computational physics problems. We condition LLM generation on the use of well-documented and widely-used packages to elicit coding capabilities in the physics and astrophysics domains. We contribute original and challenging problems in celestial mechanics (with REBOUND), stellar physics (with MESA), 1D fluid dynamics (with Dedalus) and non-linear dynamics (with SciPy). Since our problems do not admit unique solutions, we evaluate LLM performance on several soft metrics: counts of lines that contain different types of errors (coding, physics, necessity and sufficiency) as well as a more "educational" Pass-Fail metric focused on capturing the salient physical ingredients of the problem at hand. As expected, today's SOTA LLM (GPT4) zero-shot fails most of our problems, although about 40\% of the solutions could plausibly get a passing grade. About of the code lines produced are necessary, sufficient and correct (coding \& physics). Physics and coding errors are the most common, with some unnecessary or insufficient lines. We observe significant variations across problem class and difficulty. We identify several failure modes of GPT4 in the computational physics domain. Our reconnaissance work provides a snapshot of current computational capabilities in classical physics and points to obvious improvement targets if AI systems are ever to reach a basic level of autonomy in physics simulation capabilities.
Paper Structure (41 sections, 1 figure, 1 table)

This paper contains 41 sections, 1 figure, 1 table.

Figures (1)

  • Figure 1: Histograms of GPT4 solution grades, with either Fail, Pass- and Pass+ grades, grouped by code base. About $40\%$ of the solutions receive a passing grade, with significant variations across code bases. This soft metric focuses on the solution addressing the key physics ingredients for each problem. All solutions contain physics and/or coding errors.