Lean4Physics: Comprehensive Reasoning Framework for College-level Physics in Lean4
Yuxin Li, Minghao Liu, Ruida Wang, Wenzhao Ji, Zhitao He, Rui Pan, Junming Huang, Tong Zhang, Yi R. Fung
TL;DR
Lean4PHYS introduces a comprehensive framework for formal physics reasoning in Lean4 by building PhysLib, a community-driven unit-system and theorems library, and LeanPhysBench, the first Lean4-based physics benchmark containing $200$ hand-crafted statements derived from textbooks and Olympiads. The framework enables a bottom-up, modular formalization of physics across six topics and provides a structured NL-to-Lean4 formalization pipeline, including formatting alignment and verification. Across eight models and two context modes, results show suboptimal performance overall, with large general models reaching up to $40.50\%$ accuracy and PhysLib-included contexts yielding an $11.75\%$ average improvement, while expert Lean provers struggle to transfer mathematical reasoning to physics. The work demonstrates the feasibility and value of extending formal reasoning beyond mathematics into physics, and it provides open-source resources to catalyze future research in verifiable physics reasoning and broader scientific domains.
Abstract
We present **Lean4PHYS**, a comprehensive reasoning framework for college-level physics problems in Lean4. **Lean4PHYS** includes *LeanPhysBench*, a college-level benchmark for formal physics reasoning in Lean4, which contains 200 hand-crafted and peer-reviewed statements derived from university textbooks and physics competition problems. To establish a solid foundation for formal reasoning in physics, we also introduce *PhysLib*, a community-driven repository containing fundamental unit systems and theorems essential for formal physics reasoning. Based on the benchmark and Lean4 repository we composed in **Lean4PHYS**, we report baseline results using major expert Math Lean4 provers and state-of-the-art closed-source models, with the best performance of DeepSeek-Prover-V2-7B achieving only 16% and Claude-Sonnet-4 achieving 35%. We also conduct a detailed analysis showing that our *PhysLib* can achieve an average improvement of 11.75% in model performance. This demonstrates the challenging nature of our *LeanPhysBench* and the effectiveness of *PhysLib*. To the best of our knowledge, this is the first study to provide a physics benchmark in Lean4.
