SeePhys: Does Seeing Help Thinking? -- Benchmarking Vision-Based Physics Reasoning

Kun Xiang; Heng Li; Terry Jingchen Zhang; Yinya Huang; Zirong Liu; Peixin Qu; Jixi He; Jiaqi Chen; Yu-Jie Yuan; Jianhua Han; Hang Xu; Hanhui Li; Mrinmaya Sachan; Xiaodan Liang

SeePhys: Does Seeing Help Thinking? -- Benchmarking Vision-Based Physics Reasoning

Kun Xiang, Heng Li, Terry Jingchen Zhang, Yinya Huang, Zirong Liu, Peixin Qu, Jixi He, Jiaqi Chen, Yu-Jie Yuan, Jianhua Han, Hang Xu, Hanhui Li, Mrinmaya Sachan, Xiaodan Liang

TL;DR

SeePhys delivers a pure multimodal benchmark for physics reasoning, spanning 7 domains and 21 diagram types across 8 knowledge levels, to rigorously test the coupling of visual diagram interpretation with physics deduction. Despite extensive evaluations of 28 LLMs/MLLMs, frontier models struggle to exceed 55% accuracy, indicating persistent gaps in visual perception and diagram-based reasoning. The dataset separates Vision-Essential and Vision-Optional problems and includes a purely visual variant to isolate image-based reasoning, enabling detailed analysis of visual dependency and error modes. By open-sourcing data, evaluation pipelines, and insights into failure modes, SeePhys provides a valuable platform to advance visual physics understanding and multimodal world modeling in AI systems.

Abstract

We present SeePhys, a large-scale multimodal benchmark for LLM reasoning grounded in physics questions ranging from middle school to PhD qualifying exams. The benchmark covers 7 fundamental domains spanning the physics discipline, incorporating 21 categories of highly heterogeneous diagrams. In contrast to prior works where visual elements mainly serve auxiliary purposes, our benchmark features a substantial proportion of vision-essential problems (75%) that mandate visual information extraction for correct solutions. Through extensive evaluation, we observe that even the most advanced visual reasoning models (e.g., Gemini-2.5-pro and o4-mini) achieve sub-60% accuracy on our benchmark. These results reveal fundamental challenges in current large language models' visual understanding capabilities, particularly in: (i) establishing rigorous coupling between diagram interpretation and physics reasoning, and (ii) overcoming their persistent reliance on textual cues as cognitive shortcuts.

SeePhys: Does Seeing Help Thinking? -- Benchmarking Vision-Based Physics Reasoning

TL;DR

Abstract

SeePhys: Does Seeing Help Thinking? -- Benchmarking Vision-Based Physics Reasoning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (34)