Table of Contents
Fetching ...

PhysReason: A Comprehensive Benchmark towards Physics-Based Reasoning

Xinyu Zhang, Yuxuan Dong, Yanrui Wu, Jiaxing Huang, Chengyou Jia, Basura Fernando, Mike Zheng Shou, Lingling Zhang, Jun Liu

TL;DR

PhysReason introduces a large-scale, physics-focused reasoning benchmark with 1,200 problems designed to probe multi-step physics reasoning in LLMs, including a substantial multi-modal component. It pairs the dataset with the Physics Solution Auto Scoring Framework (PSAS), offering both answer-level and step-level evaluation to quantify not only final answers but the quality of intermediate reasoning. The results show that current systems struggle as reasoning depth increases, with step-level evaluation providing sharper discrimination than answer-level metrics and illuminating dominant error modes. By releasing the dataset, annotation protocol, and evaluation framework, PhysReason aims to drive improvements in physics-based reasoning for AI systems in robotics, autonomous systems, and scientific applications.

Abstract

Large language models demonstrate remarkable capabilities across various domains, especially mathematics and logic reasoning. However, current evaluations overlook physics-based reasoning - a complex task requiring physics theorems and constraints. We present PhysReason, a 1,200-problem benchmark comprising knowledge-based (25%) and reasoning-based (75%) problems, where the latter are divided into three difficulty levels (easy, medium, hard). Notably, problems require an average of 8.1 solution steps, with hard requiring 15.6, reflecting the complexity of physics-based reasoning. We propose the Physics Solution Auto Scoring Framework, incorporating efficient answer-level and comprehensive step-level evaluations. Top-performing models like Deepseek-R1, Gemini-2.0-Flash-Thinking, and o3-mini-high achieve less than 60% on answer-level evaluation, with performance dropping from knowledge questions (75.11%) to hard problems (31.95%). Through step-level evaluation, we identified four key bottlenecks: Physics Theorem Application, Physics Process Understanding, Calculation, and Physics Condition Analysis. These findings position PhysReason as a novel and comprehensive benchmark for evaluating physics-based reasoning capabilities in large language models. Our code and data will be published at https:/dxzxy12138.github.io/PhysReason.

PhysReason: A Comprehensive Benchmark towards Physics-Based Reasoning

TL;DR

PhysReason introduces a large-scale, physics-focused reasoning benchmark with 1,200 problems designed to probe multi-step physics reasoning in LLMs, including a substantial multi-modal component. It pairs the dataset with the Physics Solution Auto Scoring Framework (PSAS), offering both answer-level and step-level evaluation to quantify not only final answers but the quality of intermediate reasoning. The results show that current systems struggle as reasoning depth increases, with step-level evaluation providing sharper discrimination than answer-level metrics and illuminating dominant error modes. By releasing the dataset, annotation protocol, and evaluation framework, PhysReason aims to drive improvements in physics-based reasoning for AI systems in robotics, autonomous systems, and scientific applications.

Abstract

Large language models demonstrate remarkable capabilities across various domains, especially mathematics and logic reasoning. However, current evaluations overlook physics-based reasoning - a complex task requiring physics theorems and constraints. We present PhysReason, a 1,200-problem benchmark comprising knowledge-based (25%) and reasoning-based (75%) problems, where the latter are divided into three difficulty levels (easy, medium, hard). Notably, problems require an average of 8.1 solution steps, with hard requiring 15.6, reflecting the complexity of physics-based reasoning. We propose the Physics Solution Auto Scoring Framework, incorporating efficient answer-level and comprehensive step-level evaluations. Top-performing models like Deepseek-R1, Gemini-2.0-Flash-Thinking, and o3-mini-high achieve less than 60% on answer-level evaluation, with performance dropping from knowledge questions (75.11%) to hard problems (31.95%). Through step-level evaluation, we identified four key bottlenecks: Physics Theorem Application, Physics Process Understanding, Calculation, and Physics Condition Analysis. These findings position PhysReason as a novel and comprehensive benchmark for evaluating physics-based reasoning capabilities in large language models. Our code and data will be published at https:/dxzxy12138.github.io/PhysReason.

Paper Structure

This paper contains 47 sections, 1 equation, 10 figures, 7 tables.

Figures (10)

  • Figure 1: An illustration of example from our PhysReason benchmark. Due to space constraints, only key components are shown. Please refer to Appendix \ref{['Example']} for complete annotations.
  • Figure 2: Analysis of solution theorems, solution steps, and solution tokens across different problem categories, with comparisons from SciBench, GPQA, and OlympiadBench.
  • Figure 3: Step-level evaluation example obtained from PSAS-S framework.
  • Figure 4: Error statistics with PSAS-S framwork in PhysReason-mini, where Gemini-T-1206 and Gemini-T-0121 denote Gemini-2.0-Flash-Thinking-1206 and Gemini-2.0-Flash-Thinking-0121.
  • Figure 5: Performance with PSAS-S framework in the hard problems from PhysReason-mini.
  • ...and 5 more figures