Table of Contents
Fetching ...

PhyX: Does Your Model Have the "Wits" for Physical Reasoning?

Hui Shen, Taiqiang Wu, Qi Han, Yunta Hsieh, Jizhou Wang, Yuyue Zhang, Yuxin Cheng, Zijian Hao, Yuansheng Ni, Xin Wang, Zhongwei Wan, Kai Zhang, Wendong Xu, Jing Xiong, Ping Luo, Wenhu Chen, Chaofan Tao, Zhuoqing Mao, Ngai Wong

TL;DR

PhyX tackles a gap in AI evaluation by proposing the first large-scale multimodal benchmark focused on physics-grounded reasoning, integrating visual perception, symbolic reasoning, and real-world constraints across six physics domains and six reasoning types. The dataset comprises 3,000 visually grounded questions with expert validation and a unified evaluation protocol, enabling robust assessment of MVLLMs and LLMs through both open-ended and multiple-choice formats. Empirical results show a substantial gap between human expert performance and state-of-the-art models, with errors arising from visual misinterpretation, incomplete knowledge, and calculation mistakes, underscoring the need for deeper physical understanding and better visual grounding. The work provides a reproducible evaluation framework and granular analyses to guide future model development toward truly physically grounded AI systems with practical impact in education and scientific exploration.

Abstract

Existing benchmarks fail to capture a crucial aspect of intelligence: physical reasoning, the integrated ability to combine domain knowledge, symbolic reasoning, and understanding of real-world constraints. To address this gap, we introduce PhyX: the first large-scale benchmark designed to assess models capacity for physics-grounded reasoning in visual scenarios. PhyX includes 3K meticulously curated multimodal questions spanning 6 reasoning types across 25 sub-domains and 6 core physics domains: thermodynamics, electromagnetism, mechanics, modern physics, optics, and wave\&acoustics. In our comprehensive evaluation, even state-of-the-art models struggle significantly with physical reasoning. GPT-4o, Claude3.7-Sonnet, and GPT-o4-mini achieve only 32.5%, 42.2%, and 45.8% accuracy respectively-performance gaps exceeding 29% compared to human experts. Our analysis exposes critical limitations in current models: over-reliance on memorized disciplinary knowledge, excessive dependence on mathematical formulations, and surface-level visual pattern matching rather than genuine physical understanding. We provide in-depth analysis through fine-grained statistics, detailed case studies, and multiple evaluation paradigms to thoroughly examine physical reasoning capabilities. To ensure reproducibility, we implement a compatible evaluation protocol based on widely-used toolkits such as VLMEvalKit, enabling one-click evaluation. More details are available on our project page: https://phyx-bench.github.io/.

PhyX: Does Your Model Have the "Wits" for Physical Reasoning?

TL;DR

PhyX tackles a gap in AI evaluation by proposing the first large-scale multimodal benchmark focused on physics-grounded reasoning, integrating visual perception, symbolic reasoning, and real-world constraints across six physics domains and six reasoning types. The dataset comprises 3,000 visually grounded questions with expert validation and a unified evaluation protocol, enabling robust assessment of MVLLMs and LLMs through both open-ended and multiple-choice formats. Empirical results show a substantial gap between human expert performance and state-of-the-art models, with errors arising from visual misinterpretation, incomplete knowledge, and calculation mistakes, underscoring the need for deeper physical understanding and better visual grounding. The work provides a reproducible evaluation framework and granular analyses to guide future model development toward truly physically grounded AI systems with practical impact in education and scientific exploration.

Abstract

Existing benchmarks fail to capture a crucial aspect of intelligence: physical reasoning, the integrated ability to combine domain knowledge, symbolic reasoning, and understanding of real-world constraints. To address this gap, we introduce PhyX: the first large-scale benchmark designed to assess models capacity for physics-grounded reasoning in visual scenarios. PhyX includes 3K meticulously curated multimodal questions spanning 6 reasoning types across 25 sub-domains and 6 core physics domains: thermodynamics, electromagnetism, mechanics, modern physics, optics, and wave\&acoustics. In our comprehensive evaluation, even state-of-the-art models struggle significantly with physical reasoning. GPT-4o, Claude3.7-Sonnet, and GPT-o4-mini achieve only 32.5%, 42.2%, and 45.8% accuracy respectively-performance gaps exceeding 29% compared to human experts. Our analysis exposes critical limitations in current models: over-reliance on memorized disciplinary knowledge, excessive dependence on mathematical formulations, and surface-level visual pattern matching rather than genuine physical understanding. We provide in-depth analysis through fine-grained statistics, detailed case studies, and multiple evaluation paradigms to thoroughly examine physical reasoning capabilities. To ensure reproducibility, we implement a compatible evaluation protocol based on widely-used toolkits such as VLMEvalKit, enabling one-click evaluation. More details are available on our project page: https://phyx-bench.github.io/.

Paper Structure

This paper contains 36 sections, 51 figures, 5 tables.

Figures (51)

  • Figure 1: Accuracies of three leading MLLMs, two leading LLM and human performance on our proposed PhyX across 6 physical reasoning types and 6 domains.
  • Figure 1: A sample correct case of Mechanics. \ref{['list:list_of_figures']}$|$\ref{['tab:list_of_case_study_figures']}
  • Figure 2: Sampled PhyX examples from each domain.
  • Figure 2: A sample correct case of Mechanics. \ref{['list:list_of_figures']}$|$\ref{['tab:list_of_case_study_figures']}
  • Figure 3: Comparison with existing physics benchmarks. Realistic refers to the extent to which the dataset contains visually realistic physical scenarios. Size indicates the number of physics questions with images in multimodal benchmarks or total physics questions in text-only benchmarks. For evaluation methods, R: rule-based, M: model-based. For question type, OE: Open-ended, MC: Multiple-choice, FB: Fill-in-the-blank, J: Judgement. Upon comparison, PhyX leads in all aspects.
  • ...and 46 more figures