Table of Contents
Fetching ...

FormalML: A Benchmark for Evaluating Formal Subgoal Completion in Machine Learning Theory

Xiao-Wen Yang, Zihao Zhang, Jianuo Cao, Zhi Zhou, Zenan Li, Lan-Zhe Guo, Yuan Yao, Taolue Chen, Yu-Feng Li, Xiaoxing Ma

TL;DR

FormalML tackles subgoal completion in formal theorem proving by introducing Lean 4–based Subgoal benchmarks derived from foundational machine learning theories. It uses a Lean tactic to extract line-level subgoals from procedural proofs, yielding 4,937 problems from Optlib and FoML that blend premise retrieval with rich research contexts. Empirical results reveal persistent gaps in accuracy, efficiency, and premise retrieval for current LLM-based provers, and show that expert iteration can significantly boost subgoal completion performance. By providing a dedicated, retrieval-aware benchmark, FormalML aims to catalyze the development of more capable theorem provers that can effectively assist mathematicians on research-level proofs.

Abstract

Large language models (LLMs) have recently demonstrated remarkable progress in formal theorem proving. Yet their ability to serve as practical assistants for mathematicians, filling in missing steps within complex proofs, remains underexplored. We identify this challenge as the task of subgoal completion, where an LLM must discharge short but nontrivial proof obligations left unresolved in a human-provided sketch. To study this problem, we introduce FormalML, a Lean 4 benchmark built from foundational theories of machine learning. Using a translation tactic that converts procedural proofs into declarative form, we extract 4937 problems spanning optimization and probability inequalities, with varying levels of difficulty. FormalML is the first subgoal completion benchmark to combine premise retrieval and complex research-level contexts. Evaluation of state-of-the-art provers highlights persistent limitations in accuracy and efficiency, underscoring the need for more capable LLM-based theorem provers for effective subgoal completion,

FormalML: A Benchmark for Evaluating Formal Subgoal Completion in Machine Learning Theory

TL;DR

FormalML tackles subgoal completion in formal theorem proving by introducing Lean 4–based Subgoal benchmarks derived from foundational machine learning theories. It uses a Lean tactic to extract line-level subgoals from procedural proofs, yielding 4,937 problems from Optlib and FoML that blend premise retrieval with rich research contexts. Empirical results reveal persistent gaps in accuracy, efficiency, and premise retrieval for current LLM-based provers, and show that expert iteration can significantly boost subgoal completion performance. By providing a dedicated, retrieval-aware benchmark, FormalML aims to catalyze the development of more capable theorem provers that can effectively assist mathematicians on research-level proofs.

Abstract

Large language models (LLMs) have recently demonstrated remarkable progress in formal theorem proving. Yet their ability to serve as practical assistants for mathematicians, filling in missing steps within complex proofs, remains underexplored. We identify this challenge as the task of subgoal completion, where an LLM must discharge short but nontrivial proof obligations left unresolved in a human-provided sketch. To study this problem, we introduce FormalML, a Lean 4 benchmark built from foundational theories of machine learning. Using a translation tactic that converts procedural proofs into declarative form, we extract 4937 problems spanning optimization and probability inequalities, with varying levels of difficulty. FormalML is the first subgoal completion benchmark to combine premise retrieval and complex research-level contexts. Evaluation of state-of-the-art provers highlights persistent limitations in accuracy and efficiency, underscoring the need for more capable LLM-based theorem provers for effective subgoal completion,

Paper Structure

This paper contains 29 sections, 1 equation, 7 figures, 5 tables.

Figures (7)

  • Figure 1: The informal and formal versions of the statement and proof regarding the sufficient decrease in gradient descent. Both the formal statement and proof can be accurately formalized from the natural language version. However, to verify each step in Lean 4, there are unproven holes (notated as ) that need to be further completed by existing automated theorem provers.
  • Figure 2: An example of the tactic illustrates its functionality. When applied to the tactic , it captures pre- and post-execution proof states, abstracts their transition, and then synthesizes a subgoal.
  • Figure 3: Statistics of FormalML theorems: (Left) distribution by proving difficulty; (Middle) retrieval sources and averages; (Right) average number of retrieved premises by difficulty.
  • Figure 4: The left figure presents results of pass rate across various specific problem domains, while the right figure shows the performances under different difficulty levels.
  • Figure 5: The left figure illustrates the efficiency comparison among current whole-proof generation provers, while the right shows performances before and after expert iteration.
  • ...and 2 more figures