Table of Contents
Fetching ...

Lean4Physics: Comprehensive Reasoning Framework for College-level Physics in Lean4

Yuxin Li, Minghao Liu, Ruida Wang, Wenzhao Ji, Zhitao He, Rui Pan, Junming Huang, Tong Zhang, Yi R. Fung

TL;DR

Lean4PHYS introduces a comprehensive framework for formal physics reasoning in Lean4 by building PhysLib, a community-driven unit-system and theorems library, and LeanPhysBench, the first Lean4-based physics benchmark containing $200$ hand-crafted statements derived from textbooks and Olympiads. The framework enables a bottom-up, modular formalization of physics across six topics and provides a structured NL-to-Lean4 formalization pipeline, including formatting alignment and verification. Across eight models and two context modes, results show suboptimal performance overall, with large general models reaching up to $40.50\%$ accuracy and PhysLib-included contexts yielding an $11.75\%$ average improvement, while expert Lean provers struggle to transfer mathematical reasoning to physics. The work demonstrates the feasibility and value of extending formal reasoning beyond mathematics into physics, and it provides open-source resources to catalyze future research in verifiable physics reasoning and broader scientific domains.

Abstract

We present **Lean4PHYS**, a comprehensive reasoning framework for college-level physics problems in Lean4. **Lean4PHYS** includes *LeanPhysBench*, a college-level benchmark for formal physics reasoning in Lean4, which contains 200 hand-crafted and peer-reviewed statements derived from university textbooks and physics competition problems. To establish a solid foundation for formal reasoning in physics, we also introduce *PhysLib*, a community-driven repository containing fundamental unit systems and theorems essential for formal physics reasoning. Based on the benchmark and Lean4 repository we composed in **Lean4PHYS**, we report baseline results using major expert Math Lean4 provers and state-of-the-art closed-source models, with the best performance of DeepSeek-Prover-V2-7B achieving only 16% and Claude-Sonnet-4 achieving 35%. We also conduct a detailed analysis showing that our *PhysLib* can achieve an average improvement of 11.75% in model performance. This demonstrates the challenging nature of our *LeanPhysBench* and the effectiveness of *PhysLib*. To the best of our knowledge, this is the first study to provide a physics benchmark in Lean4.

Lean4Physics: Comprehensive Reasoning Framework for College-level Physics in Lean4

TL;DR

Lean4PHYS introduces a comprehensive framework for formal physics reasoning in Lean4 by building PhysLib, a community-driven unit-system and theorems library, and LeanPhysBench, the first Lean4-based physics benchmark containing hand-crafted statements derived from textbooks and Olympiads. The framework enables a bottom-up, modular formalization of physics across six topics and provides a structured NL-to-Lean4 formalization pipeline, including formatting alignment and verification. Across eight models and two context modes, results show suboptimal performance overall, with large general models reaching up to accuracy and PhysLib-included contexts yielding an average improvement, while expert Lean provers struggle to transfer mathematical reasoning to physics. The work demonstrates the feasibility and value of extending formal reasoning beyond mathematics into physics, and it provides open-source resources to catalyze future research in verifiable physics reasoning and broader scientific domains.

Abstract

We present **Lean4PHYS**, a comprehensive reasoning framework for college-level physics problems in Lean4. **Lean4PHYS** includes *LeanPhysBench*, a college-level benchmark for formal physics reasoning in Lean4, which contains 200 hand-crafted and peer-reviewed statements derived from university textbooks and physics competition problems. To establish a solid foundation for formal reasoning in physics, we also introduce *PhysLib*, a community-driven repository containing fundamental unit systems and theorems essential for formal physics reasoning. Based on the benchmark and Lean4 repository we composed in **Lean4PHYS**, we report baseline results using major expert Math Lean4 provers and state-of-the-art closed-source models, with the best performance of DeepSeek-Prover-V2-7B achieving only 16% and Claude-Sonnet-4 achieving 35%. We also conduct a detailed analysis showing that our *PhysLib* can achieve an average improvement of 11.75% in model performance. This demonstrates the challenging nature of our *LeanPhysBench* and the effectiveness of *PhysLib*. To the best of our knowledge, this is the first study to provide a physics benchmark in Lean4.

Paper Structure

This paper contains 36 sections, 6 equations, 8 figures, 1 table.

Figures (8)

  • Figure 1: Overview of the Lean4PHYS framework: (a) Benchmark and Library Construction: The benchmark and library are developed using a bottom-up principle. We first establish the foundational units and SI system of PhysLib, then compose the calculation support, and finally construct the field-related theorems. For the benchmark, NL questions are transformed from a question-answering format to proof problems and subsequently formalized into Lean4 statements. (b) LLM Performance Evaluation: We evaluate the formal physics reasoning capabilities of major open and closed-source LLMs on LeanPhysBench, both with and without PhysLib. After inference is completed, Lean4 auto-verification evaluates model performance and highlights the utility of PhysLib.
  • Figure 2: Two examples from LeanPhysBench demonstrating different difficulty levels.
  • Figure 3: Statistics of LeanPhysBench: The distribution of 200 Lean4 physics statements across difficulty levels (on the left) and topics (on the right).
  • Figure 4: Three sampled physics questions from College Textbook, Olympics-Easy, Olympics-Hard problems. Each example shows the natural language problem statement followed by its corresponding Lean formalization with a verified proof.
  • Figure 5: Two examples from LeanPhysBench demonstrating different proofs of the same college-level problem, generated by Gemini-2.5-pro with and without PhysLib.
  • ...and 3 more figures