Table of Contents
Fetching ...

PRISM-Physics: Causal DAG-Based Process Evaluation for Physics Reasoning

Wanjia Zhao, Qinwei Ma, Jingzhe Shi, Shirley Wu, Jiaqi Han, Yijia Xiao, Si-Yuan Chen, Xiao Luo, Ludwig Schmidt, James Zou

TL;DR

PRISM-Physics provides a principled foundation for advancing process-level evaluation and guiding the development of models with deeper scientific reasoning capabilities, by combining structural rigor, theoretical guarantees, and symbolic validation.

Abstract

Benchmarks for competition-style reasoning have advanced evaluation in mathematics and programming, yet physics remains comparatively explored. Most existing physics benchmarks evaluate only final answers, which fail to capture reasoning processes, while recent stepwise methods rely on heuristic LLM-as-judge scoring or restrictive linear assumptions, limiting reliability and diagnostic validity. We introduce PRISM-Physics, a process-level evaluation framework and benchmark for complex physics reasoning problems. Solutions are represented as directed acyclic graphs (DAGs) of formulas, explicitly encoding causal dependencies among intermediate steps to enable fine-grained, interpretable, and theoretically grounded scoring. We prove the optimality of the DAG representation and the corresponding scoring policy. Combining with a fully rule-based method for symbolic formula equivalence matching that we developed, we ensure consistent validation across diverse formulations without heuristic judgments. Results show that our evaluation framework is more aligned with human experts' scoring. Experiments on state-of-the-art LLMs reveal persistent reasoning failures in physics, while step-level scoring offers both diagnostic insight and rich signals for later training. By combining structural rigor, theoretical guarantees, and symbolic validation, PRISM-Physics provides a principled foundation for advancing process-level evaluation and guiding the development of models with deeper scientific reasoning capabilities.

PRISM-Physics: Causal DAG-Based Process Evaluation for Physics Reasoning

TL;DR

PRISM-Physics provides a principled foundation for advancing process-level evaluation and guiding the development of models with deeper scientific reasoning capabilities, by combining structural rigor, theoretical guarantees, and symbolic validation.

Abstract

Benchmarks for competition-style reasoning have advanced evaluation in mathematics and programming, yet physics remains comparatively explored. Most existing physics benchmarks evaluate only final answers, which fail to capture reasoning processes, while recent stepwise methods rely on heuristic LLM-as-judge scoring or restrictive linear assumptions, limiting reliability and diagnostic validity. We introduce PRISM-Physics, a process-level evaluation framework and benchmark for complex physics reasoning problems. Solutions are represented as directed acyclic graphs (DAGs) of formulas, explicitly encoding causal dependencies among intermediate steps to enable fine-grained, interpretable, and theoretically grounded scoring. We prove the optimality of the DAG representation and the corresponding scoring policy. Combining with a fully rule-based method for symbolic formula equivalence matching that we developed, we ensure consistent validation across diverse formulations without heuristic judgments. Results show that our evaluation framework is more aligned with human experts' scoring. Experiments on state-of-the-art LLMs reveal persistent reasoning failures in physics, while step-level scoring offers both diagnostic insight and rich signals for later training. By combining structural rigor, theoretical guarantees, and symbolic validation, PRISM-Physics provides a principled foundation for advancing process-level evaluation and guiding the development of models with deeper scientific reasoning capabilities.

Paper Structure

This paper contains 57 sections, 2 theorems, 152 equations, 8 figures, 3 tables, 2 algorithms.

Key Result

Theorem 1

Fix $\mathcal{F}=\{F_1,\dots,F_N\}$ with the index order $1<\cdots<N$. Let $\mathsf{Just}$ be the class of justification kernels $\vdash\subseteq\mathcal{F}\times\mathcal{F}$ that satisfy Assumptions assumption-singleton and assumption-causality (so $F_i\vdash F_j \Rightarrow i<j$). Let $\mathsf{DAG Then: Consequently, $\Phi$ and $\Psi$ are mutual inverses and yield a bijection $\mathsf{Just}\;\c

Figures (8)

  • Figure 1: Model performance on PRISM-Physics. We reported both Final-Answer Accuracy, Step-level Accuracy and Response Time.
  • Figure 2: Left: Statistics of PRISM-Physics hierarchical topics and difficulty level. Right: A data example with the proposed DAG structure.
  • Figure 3: Step-level and final-answer accuracy across Physics Domain Categories and Difficulty Levels.
  • Figure 4: Comparison of accuracy and response time across reasoning levels.
  • Figure 5: Distribution of primary error types across models
  • ...and 3 more figures

Theorems & Definitions (11)

  • Definition 1: Ancestor Closure
  • Definition 2: Ancestor Closure Scoring Policy
  • Definition 3: Justification System
  • Remark
  • Theorem 1: Bijection between order-keeping justifications and DAGs
  • Remark
  • Definition 4: Admissible Scoring Policy
  • Theorem 2: Exact Characterization of Scored Formulas
  • Remark
  • proof
  • ...and 1 more