Combining Textual and Structural Information for Premise Selection in Lean
Job Petrovčič, David Eliecer Narvaez Denis, Ljupčo Todorovski
TL;DR
Premise selection in Lean proof environments is hindered by ignoring library structure. The authors blend ByT5-based textual embeddings with a Relational Graph Convolutional Network over a heterogeneous Lean dependency graph to produce graph-aware representations for proof states and premises. On the LeanDojo Mathlib benchmark, the graph-augmented approach improves Recall@1, Recall@10, and MRR over a ReProver baseline, with additional gains from ensemble EMA, though ablations suggest training choices may drive much of the improvement. The work highlights the potential of incorporating structural dependencies into premise retrieval for tactic-based Lean proofs and points to future work on graph attention and alternative data splits.
Abstract
Premise selection is a key bottleneck for scaling theorem proving in large formal libraries. Yet existing language-based methods often treat premises in isolation, ignoring the web of dependencies that connects them. We present a graph-augmented approach that combines dense text embeddings of Lean formalizations with graph neural networks over a heterogeneous dependency graph capturing both state-premise and premise-premise relations. On the LeanDojo Benchmark, our method outperforms the ReProver language-based baseline by over 25\% across standard retrieval metrics. These results suggest that relational information is beneficial for premise selection.
