SimLM: Can Language Models Infer Parameters of Physical Systems?

Sean Memery; Mirella Lapata; Kartic Subr

SimLM: Can Language Models Infer Parameters of Physical Systems?

Sean Memery, Mirella Lapata, Kartic Subr

TL;DR

The paper investigates whether Large Language Models can infer physical system parameters from observations, focusing on an inverse-physics task to estimate $(h,v)$ so the third bounce lands near a target. It introduces SimLM, a simulator-augmented prompting approach that interleaves physics simulation feedback with reasoning and self-critique, iterating up to $N=5$ times and leveraging past successful exemplars. In 2D experiments, SimLM improves over baseline CoT, with larger gains on harder, uneven terrains, and relative error dropping below $1$ in many cases; however, in a higher-dimensional 3D billiards task, all models show limited success and only marginal gains from simulation. The work demonstrates that grounding LLMs with physical simulators can enhance physics reasoning in 2D but also highlights current limits for complex 3D scenarios, pointing to simulator-grounded reasoning as a promising direction with needed advances for high-dimensional parameter inference.

Abstract

Several machine learning methods aim to learn or reason about complex physical systems. A common first-step towards reasoning is to infer system parameters from observations of its behavior. In this paper, we investigate the performance of Large Language Models (LLMs) at performing parameter inference in the context of physical systems. Our experiments suggest that they are not inherently suited to this task, even for simple systems. We propose a promising direction of exploration, which involves the use of physical simulators to augment the context of LLMs. We assess and compare the performance of different LLMs on a simple example with and without access to physical simulation.

SimLM: Can Language Models Infer Parameters of Physical Systems?

TL;DR

The paper investigates whether Large Language Models can infer physical system parameters from observations, focusing on an inverse-physics task to estimate

so the third bounce lands near a target. It introduces SimLM, a simulator-augmented prompting approach that interleaves physics simulation feedback with reasoning and self-critique, iterating up to

times and leveraging past successful exemplars. In 2D experiments, SimLM improves over baseline CoT, with larger gains on harder, uneven terrains, and relative error dropping below

in many cases; however, in a higher-dimensional 3D billiards task, all models show limited success and only marginal gains from simulation. The work demonstrates that grounding LLMs with physical simulators can enhance physics reasoning in 2D but also highlights current limits for complex 3D scenarios, pointing to simulator-grounded reasoning as a promising direction with needed advances for high-dimensional parameter inference.

Abstract

Paper Structure (25 sections, 7 equations, 5 figures, 2 tables, 1 algorithm)

This paper contains 25 sections, 7 equations, 5 figures, 2 tables, 1 algorithm.

Introduction
Background
Previous Work
Physical System
Prompting Methods Notation
Retrieval-Augmented Generation
Self-Critique
Few-shot Chain-of-Thought
SimLM
Method
Implementation
Physics Simulation
LLM Inference
Experiments
Flat Surface
...and 10 more sections

Figures (5)

Figure 1: We posed a number of Large Language Models a simple query $Q$: "With what horizontal velocity $v$ and from what height $h$ should a ball be thrown so that its third bounce is within $1$ m of $50$ m". Most models are incapable of answering this query. The figure shows the mean errors when the ground is flat (A) as well as sinusoidal (B). Surprisingly, providing more examples worsens the performance when the forward problem is difficult (sinusoidal case). The central proposition in this article is that augmentation of queries/prompts via a physics simulator enhances the ability of LLMs to reason about physics for difficult problems.
Figure 2: Visualizations capturing the difficulty of problems as a function of the ground surface (top row) and mean error across trials (bottom row). Errors (heat map colors) are shown across the parameter space spanned by horizontal velocity (X-axis) and initial height (Y-axis) measured as the mean distance away from the target (50m).
Figure 3: A plot of the error ratio of out method vs baseline for surfaces with varying difficulty. The error ratio is less than one showing that SimLM outperforms the baseline. As the problem becomes more difficult, SimLM is relatively more effective.
Figure 4: The use of physics simulation data helps LLMs (GPT and PaLM) cope with increasing difficulty. Plots of error in choosing initial conditions for a ball to bounce at a target distance on a flat surface (experiment A) and on an uneven surface (experiment B).
Figure 5: An LLM taking a shot using the PoolTool simulator.

SimLM: Can Language Models Infer Parameters of Physical Systems?

TL;DR

Abstract

SimLM: Can Language Models Infer Parameters of Physical Systems?

Authors

TL;DR

Abstract

Table of Contents

Figures (5)