Table of Contents
Fetching ...

Exploring the Limits of Fine-grained LLM-based Physics Inference via Premise Removal Interventions

Jordan Meadows, Tamsin James, Andre Freitas

TL;DR

It is found that the models' mathematical reasoning is not physics-informed in this setting, where physical context is predominantly ignored in favour of reverse-engineering solutions.

Abstract

Language models (LMs) can hallucinate when performing complex mathematical reasoning. Physics provides a rich domain for assessing their mathematical capabilities, where physical context requires that any symbolic manipulation satisfies complex semantics (\textit{e.g.,} units, tensorial order). In this work, we systematically remove crucial context from prompts to force instances where model inference may be algebraically coherent, yet unphysical. We assess LM capabilities in this domain using a curated dataset encompassing multiple notations and Physics subdomains. Further, we improve zero-shot scores using synthetic in-context examples, and demonstrate non-linear degradation of derivation quality with perturbation strength via the progressive omission of supporting premises. We find that the models' mathematical reasoning is not physics-informed in this setting, where physical context is predominantly ignored in favour of reverse-engineering solutions.

Exploring the Limits of Fine-grained LLM-based Physics Inference via Premise Removal Interventions

TL;DR

It is found that the models' mathematical reasoning is not physics-informed in this setting, where physical context is predominantly ignored in favour of reverse-engineering solutions.

Abstract

Language models (LMs) can hallucinate when performing complex mathematical reasoning. Physics provides a rich domain for assessing their mathematical capabilities, where physical context requires that any symbolic manipulation satisfies complex semantics (\textit{e.g.,} units, tensorial order). In this work, we systematically remove crucial context from prompts to force instances where model inference may be algebraically coherent, yet unphysical. We assess LM capabilities in this domain using a curated dataset encompassing multiple notations and Physics subdomains. Further, we improve zero-shot scores using synthetic in-context examples, and demonstrate non-linear degradation of derivation quality with perturbation strength via the progressive omission of supporting premises. We find that the models' mathematical reasoning is not physics-informed in this setting, where physical context is predominantly ignored in favour of reverse-engineering solutions.
Paper Structure (39 sections, 79 equations, 5 figures, 6 tables, 1 algorithm)

This paper contains 39 sections, 79 equations, 5 figures, 6 tables, 1 algorithm.

Figures (5)

  • Figure 1: An incorrect derivation generated by few-shot GPT-4 that scores high ROUGE (81), BLEU (71), and GLEU (71). Erroneous equations are denoted in red.
  • Figure 2: The difference between the Wikipedia proof (left) and our equational interpretation (right) of a reasoning chain related to the Uncertainty Principle in quantum mechanics. The (red) values represent the number of intermediate equations between equivalent equations in each representation, and highlights the detail gap.
  • Figure 3: $P(L)$ is the probability that a given derivation contains $L$ equations.
  • Figure 4: $P(N)$ is the probability that a given derivation contains $N$ premise equations.
  • Figure 5: Excerpts from few-shot GPT-4 derivations that violate well-documented Physics.