Table of Contents
Fetching ...

PhysicsEval: Inference-Time Techniques to Improve the Reasoning Proficiency of Large Language Models on Physics Problems

Oshayer Siddique, J. M Areeb Uzair Alam, Md Jobayer Rahman Rafy, Syed Rifat Raiyan, Hasan Mahmud, Md Kamrul Hasan

TL;DR

This work tackles improving physics problem solving by LLMs through inference-time strategies. It introduces PhysicsEval, a 19,609-problem benchmark, and evaluates four approaches—baseline, self-refinement, single-agent verification, and a multi-agent review framework—using a rubric-based scoring scheme and the Physics Proficiency Score (PPS). The multi-agent framework, combining proposer, verifiers, and a meta-verifier, yields notable gains, especially on harder problems, though gains vary by domain and come with higher computational cost. The study provides a foundation for adaptive, category-aware reasoning enhancements and releases the dataset and code to promote reproducibility and further research, with PPS offering a concrete, interpretable measure of physics reasoning proficiency.

Abstract

The discipline of physics stands as a cornerstone of human intellect, driving the evolution of technology and deepening our understanding of the fundamental principles of the cosmos. Contemporary literature includes some works centered on the task of solving physics problems - a crucial domain of natural language reasoning. In this paper, we evaluate the performance of frontier LLMs in solving physics problems, both mathematical and descriptive. We also employ a plethora of inference-time techniques and agentic frameworks to improve the performance of the models. This includes the verification of proposed solutions in a cumulative fashion by other, smaller LLM agents, and we perform a comparative analysis of the performance that the techniques entail. There are significant improvements when the multi-agent framework is applied to problems that the models initially perform poorly on. Furthermore, we introduce a new evaluation benchmark for physics problems, ${\rm P{\small HYSICS}E{\small VAL}}$, consisting of 19,609 problems sourced from various physics textbooks and their corresponding correct solutions scraped from physics forums and educational websites. Our code and data are publicly available at https://github.com/areebuzair/PhysicsEval.

PhysicsEval: Inference-Time Techniques to Improve the Reasoning Proficiency of Large Language Models on Physics Problems

TL;DR

This work tackles improving physics problem solving by LLMs through inference-time strategies. It introduces PhysicsEval, a 19,609-problem benchmark, and evaluates four approaches—baseline, self-refinement, single-agent verification, and a multi-agent review framework—using a rubric-based scoring scheme and the Physics Proficiency Score (PPS). The multi-agent framework, combining proposer, verifiers, and a meta-verifier, yields notable gains, especially on harder problems, though gains vary by domain and come with higher computational cost. The study provides a foundation for adaptive, category-aware reasoning enhancements and releases the dataset and code to promote reproducibility and further research, with PPS offering a concrete, interpretable measure of physics reasoning proficiency.

Abstract

The discipline of physics stands as a cornerstone of human intellect, driving the evolution of technology and deepening our understanding of the fundamental principles of the cosmos. Contemporary literature includes some works centered on the task of solving physics problems - a crucial domain of natural language reasoning. In this paper, we evaluate the performance of frontier LLMs in solving physics problems, both mathematical and descriptive. We also employ a plethora of inference-time techniques and agentic frameworks to improve the performance of the models. This includes the verification of proposed solutions in a cumulative fashion by other, smaller LLM agents, and we perform a comparative analysis of the performance that the techniques entail. There are significant improvements when the multi-agent framework is applied to problems that the models initially perform poorly on. Furthermore, we introduce a new evaluation benchmark for physics problems, , consisting of 19,609 problems sourced from various physics textbooks and their corresponding correct solutions scraped from physics forums and educational websites. Our code and data are publicly available at https://github.com/areebuzair/PhysicsEval.

Paper Structure

This paper contains 37 sections, 5 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: Example of an astrophysics problem from the PhysicsEval benchmark.
  • Figure 2: An overview of the multi-agent review model. The model names are, of course, subject to shuffling.
  • Figure 3: Category-specific impact of the Multi-Agent Review framework across all scoring rubrics for o4-mini.
  • Figure 4: Multi-Agent PPS by Model and Physics Category