Table of Contents
Fetching ...

PHYBench: Holistic Evaluation of Physical Perception and Reasoning in Large Language Models

Shi Qiu, Shaoyang Guo, Zhuo-Yang Song, Yunbo Sun, Zeyu Cai, Jiashen Wei, Tianyu Luo, Yixuan Yin, Haoxu Zhang, Yi Hu, Chenyang Wang, Chencheng Tang, Haoling Chang, Qi Liu, Ziheng Zhou, Tianyu Zhang, Jingtian Zhang, Zhangyi Liu, Minghao Li, Yuku Zhang, Boxuan Jing, Xianqi Yin, Yutong Ren, Zizhuo Fu, Jiaming Ji, Weike Wang, Xudong Tian, Anqi Lv, Laifu Man, Jianxiang Li, Feiyu Tao, Qihua Sun, Zhou Liang, Yushu Mu, Zhongxuan Li, Jing-Jun Zhang, Shutao Zhang, Xiaotian Li, Xingqi Xia, Jiawei Lin, Zheyu Shen, Jiahang Chen, Qiuhao Xiong, Binran Wang, Fengyuan Wang, Ziyang Ni, Bohan Zhang, Fan Cui, Changkun Shao, Qing-Hong Cao, Ming-xing Luo, Yaodong Yang, Muhan Zhang, Hua Xing Zhu

TL;DR

PHYBench tackles critical weaknesses of existing reasoning benchmarks by providing 500 original physics problems spanning multiple domains to probe physical perception and robust multi-step reasoning in LLMs. It introduces the Expression Edit Distance (EED) Score to evaluate symbolic answers with partial credit, and demonstrates that human experts significantly outperform current models. The paper demonstrates stronger differentiation among models with PHYBench than with prior benchmarks and shows substantial gains in evaluation efficiency through EED. It also analyzes the nature of model errors using stage-wise localization and perturbation tests, highlighting gaps in semantic understanding and long-horizon reasoning.

Abstract

Current benchmarks for evaluating the reasoning capabilities of Large Language Models (LLMs) face significant limitations: task oversimplification, data contamination, and flawed evaluation items. These deficiencies necessitate more rigorous assessment methods. To address these limitations, we introduce PHYBench, a benchmark of 500 original physics problems ranging from high school to Physics Olympiad difficulty. PHYBench addresses data contamination through original content and employs a systematic curation pipeline to eliminate flawed items. Evaluations show that PHYBench activates more tokens and provides stronger differentiation between reasoning models compared to other baselines like AIME 2024, OlympiadBench and GPQA. Even the best-performing model, Gemini 2.5 Pro, achieves only 36.9% accuracy compared to human experts' 61.9%. To further enhance evaluation precision, we introduce the Expression Edit Distance (EED) Score for mathematical expression assessment, which improves sample efficiency by 204% over binary scoring. Moreover, PHYBench effectively elicits multi-step and multi-condition reasoning, providing a platform for examining models' reasoning robustness, preferences, and deficiencies. The benchmark results and dataset are publicly available at https://www.phybench.cn/.

PHYBench: Holistic Evaluation of Physical Perception and Reasoning in Large Language Models

TL;DR

PHYBench tackles critical weaknesses of existing reasoning benchmarks by providing 500 original physics problems spanning multiple domains to probe physical perception and robust multi-step reasoning in LLMs. It introduces the Expression Edit Distance (EED) Score to evaluate symbolic answers with partial credit, and demonstrates that human experts significantly outperform current models. The paper demonstrates stronger differentiation among models with PHYBench than with prior benchmarks and shows substantial gains in evaluation efficiency through EED. It also analyzes the nature of model errors using stage-wise localization and perturbation tests, highlighting gaps in semantic understanding and long-horizon reasoning.

Abstract

Current benchmarks for evaluating the reasoning capabilities of Large Language Models (LLMs) face significant limitations: task oversimplification, data contamination, and flawed evaluation items. These deficiencies necessitate more rigorous assessment methods. To address these limitations, we introduce PHYBench, a benchmark of 500 original physics problems ranging from high school to Physics Olympiad difficulty. PHYBench addresses data contamination through original content and employs a systematic curation pipeline to eliminate flawed items. Evaluations show that PHYBench activates more tokens and provides stronger differentiation between reasoning models compared to other baselines like AIME 2024, OlympiadBench and GPQA. Even the best-performing model, Gemini 2.5 Pro, achieves only 36.9% accuracy compared to human experts' 61.9%. To further enhance evaluation precision, we introduce the Expression Edit Distance (EED) Score for mathematical expression assessment, which improves sample efficiency by 204% over binary scoring. Moreover, PHYBench effectively elicits multi-step and multi-condition reasoning, providing a platform for examining models' reasoning robustness, preferences, and deficiencies. The benchmark results and dataset are publicly available at https://www.phybench.cn/.

Paper Structure

This paper contains 51 sections, 64 equations, 12 figures, 7 tables.

Figures (12)

  • Figure 1: Model performance on PHYBench. We report accuracy and EED Score for both reasoning and general language models, averaged over all samples.
  • Figure 2: An example problem from PHYBench. Two evaluation metrics are employed: Expression Edit Distance (EED) Score and accuracy. We show the scores for three different responses, with Model Answer 1 and Model Answer 2 generated by DeepSeek-R1 and GPT-4o respectively.
  • Figure 3: Pipeline of PHYBench data curation.
  • Figure 4: Token Usage and Score of Typical Models on Different Benchmarks
  • Figure 5: TTS on PHYBench: comparison between pass@$k$ and majority voting strategies, both evaluated under varying numbers of sampled responses $k$ (log-scale on the x-axis).
  • ...and 7 more figures