Table of Contents
Fetching ...

SeePhys: Does Seeing Help Thinking? -- Benchmarking Vision-Based Physics Reasoning

Kun Xiang, Heng Li, Terry Jingchen Zhang, Yinya Huang, Zirong Liu, Peixin Qu, Jixi He, Jiaqi Chen, Yu-Jie Yuan, Jianhua Han, Hang Xu, Hanhui Li, Mrinmaya Sachan, Xiaodan Liang

TL;DR

SeePhys delivers a pure multimodal benchmark for physics reasoning, spanning 7 domains and 21 diagram types across 8 knowledge levels, to rigorously test the coupling of visual diagram interpretation with physics deduction. Despite extensive evaluations of 28 LLMs/MLLMs, frontier models struggle to exceed 55% accuracy, indicating persistent gaps in visual perception and diagram-based reasoning. The dataset separates Vision-Essential and Vision-Optional problems and includes a purely visual variant to isolate image-based reasoning, enabling detailed analysis of visual dependency and error modes. By open-sourcing data, evaluation pipelines, and insights into failure modes, SeePhys provides a valuable platform to advance visual physics understanding and multimodal world modeling in AI systems.

Abstract

We present SeePhys, a large-scale multimodal benchmark for LLM reasoning grounded in physics questions ranging from middle school to PhD qualifying exams. The benchmark covers 7 fundamental domains spanning the physics discipline, incorporating 21 categories of highly heterogeneous diagrams. In contrast to prior works where visual elements mainly serve auxiliary purposes, our benchmark features a substantial proportion of vision-essential problems (75%) that mandate visual information extraction for correct solutions. Through extensive evaluation, we observe that even the most advanced visual reasoning models (e.g., Gemini-2.5-pro and o4-mini) achieve sub-60% accuracy on our benchmark. These results reveal fundamental challenges in current large language models' visual understanding capabilities, particularly in: (i) establishing rigorous coupling between diagram interpretation and physics reasoning, and (ii) overcoming their persistent reliance on textual cues as cognitive shortcuts.

SeePhys: Does Seeing Help Thinking? -- Benchmarking Vision-Based Physics Reasoning

TL;DR

SeePhys delivers a pure multimodal benchmark for physics reasoning, spanning 7 domains and 21 diagram types across 8 knowledge levels, to rigorously test the coupling of visual diagram interpretation with physics deduction. Despite extensive evaluations of 28 LLMs/MLLMs, frontier models struggle to exceed 55% accuracy, indicating persistent gaps in visual perception and diagram-based reasoning. The dataset separates Vision-Essential and Vision-Optional problems and includes a purely visual variant to isolate image-based reasoning, enabling detailed analysis of visual dependency and error modes. By open-sourcing data, evaluation pipelines, and insights into failure modes, SeePhys provides a valuable platform to advance visual physics understanding and multimodal world modeling in AI systems.

Abstract

We present SeePhys, a large-scale multimodal benchmark for LLM reasoning grounded in physics questions ranging from middle school to PhD qualifying exams. The benchmark covers 7 fundamental domains spanning the physics discipline, incorporating 21 categories of highly heterogeneous diagrams. In contrast to prior works where visual elements mainly serve auxiliary purposes, our benchmark features a substantial proportion of vision-essential problems (75%) that mandate visual information extraction for correct solutions. Through extensive evaluation, we observe that even the most advanced visual reasoning models (e.g., Gemini-2.5-pro and o4-mini) achieve sub-60% accuracy on our benchmark. These results reveal fundamental challenges in current large language models' visual understanding capabilities, particularly in: (i) establishing rigorous coupling between diagram interpretation and physics reasoning, and (ii) overcoming their persistent reliance on textual cues as cognitive shortcuts.

Paper Structure

This paper contains 36 sections, 34 figures, 6 tables.

Figures (34)

  • Figure 1: Overview of SeePhys. It encompasses 7 core physics domains and 21 diagram types, spanning the full knowledge spectrum from middle school to PhD candidacy exams levels.
  • Figure 2: Examples of Vision-Optional/Vision-Essential questions. In Vision-Optional samples, texts provide sufficient visual descriptions (e.g., graphical attributes and spatial relationships) to help respondents with illustration. In Essential samples, images contain indispensable problem-solving information, such as numerical values for key variables and unspecified topological structures.
  • Figure 3: The sensitivity of models to different diagram types under TV/TC/TO/VO settings.
  • Figure 4: Examples of primary error patterns. Quantitative analyses are presented in Appendix E.
  • Figure 5: Statistics of our benchmark.
  • ...and 29 more figures