Table of Contents
Fetching ...

miniCTX: Neural Theorem Proving with (Long-)Contexts

Jiewen Hu, Thomas Zhu, Sean Welleck

TL;DR

miniCTX introduces a context-rich benchmark for neural theorem proving in Lean 4, challenging models with the need to use long, real-world context from multiple files and projects. The authors deploy ntp-toolkit to automatically extract contexts and provide a Lean REPL-based evaluation pipeline, enabling both next-tactic and full-proof generation tasks. Across baselines including GPT-4o prompting and file-tuning, results show that incorporating preceding file content and broader context substantially improves proof success, while cross-file premise retrieval yields mixed gains and can interfere with in-file context. The work highlights that contemporary benchmarks often miss context dependence and data contamination issues, and shows that robust context-enabled approaches are essential for practical, real-world theorem proving. Overall, miniCTX provides a rigorous, updateable framework for evaluating context-aware provers and motivates future approaches to long-context integration and cross-file reasoning in formalization projects.

Abstract

Real-world formal theorem proving often depends on a wealth of context, including definitions, lemmas, comments, file structure, and other information. We introduce miniCTX, which tests a model's ability to prove formal mathematical theorems that depend on new context that is not seen during training. miniCTX contains theorems sourced from real Lean projects and textbooks, each associated with a context that can span tens of thousands of tokens. Models are tasked with proving a theorem given access to code from the theorem's repository, which contains context that is needed for the proof. As a baseline for miniCTX, we tested fine-tuning and prompting methods that condition theorem proving on preceding context. Both approaches substantially outperform traditional methods that rely solely on state information. We found that this ability to use context is not captured by previous benchmarks such as miniF2F. Alongside miniCTX, we offer ntp-toolkit for automatically extracting and annotating theorem proving data, making it easy to add new projects into miniCTX to ensure that contexts are not seen during training. miniCTX offers a challenging and realistic evaluation of neural theorem provers.

miniCTX: Neural Theorem Proving with (Long-)Contexts

TL;DR

miniCTX introduces a context-rich benchmark for neural theorem proving in Lean 4, challenging models with the need to use long, real-world context from multiple files and projects. The authors deploy ntp-toolkit to automatically extract contexts and provide a Lean REPL-based evaluation pipeline, enabling both next-tactic and full-proof generation tasks. Across baselines including GPT-4o prompting and file-tuning, results show that incorporating preceding file content and broader context substantially improves proof success, while cross-file premise retrieval yields mixed gains and can interfere with in-file context. The work highlights that contemporary benchmarks often miss context dependence and data contamination issues, and shows that robust context-enabled approaches are essential for practical, real-world theorem proving. Overall, miniCTX provides a rigorous, updateable framework for evaluating context-aware provers and motivates future approaches to long-context integration and cross-file reasoning in formalization projects.

Abstract

Real-world formal theorem proving often depends on a wealth of context, including definitions, lemmas, comments, file structure, and other information. We introduce miniCTX, which tests a model's ability to prove formal mathematical theorems that depend on new context that is not seen during training. miniCTX contains theorems sourced from real Lean projects and textbooks, each associated with a context that can span tens of thousands of tokens. Models are tasked with proving a theorem given access to code from the theorem's repository, which contains context that is needed for the proof. As a baseline for miniCTX, we tested fine-tuning and prompting methods that condition theorem proving on preceding context. Both approaches substantially outperform traditional methods that rely solely on state information. We found that this ability to use context is not captured by previous benchmarks such as miniF2F. Alongside miniCTX, we offer ntp-toolkit for automatically extracting and annotating theorem proving data, making it easy to add new projects into miniCTX to ensure that contexts are not seen during training. miniCTX offers a challenging and realistic evaluation of neural theorem provers.
Paper Structure (41 sections, 1 equation, 8 figures, 6 tables)

This paper contains 41 sections, 1 equation, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Many state of the art provers are trained on a static dataset of theorems and proofs, then evaluated on standalone problems such as competition problems (left). We argue that neural provers must also operate in the realistic context-dependent setting, in which results depend on working with new mathematical objects and their facts, notations, and the structural elements of the project (imports, variables, etc.) (right).
  • Figure 2: miniCTX is automatically updated with Lean projects to stay ahead of LLM training cutoff dates, making it a suitable benchmark for real-world theorem proving for pre-trained models.
  • Figure 3: State-tactic vs. file tuning.
  • Figure 4: Model performance by dependency on premises. For each theorem in miniCTX, we record as metadata whether its human-written proof depends on other definitions or theorems in the same file ("in-file") or in other files ("cross-file"), and test the performance of baselines on each type.
  • Figure 5: Percentage of different dependencies in the human-written proof of theorems in miniCTX.
  • ...and 3 more figures