Combining Textual and Structural Information for Premise Selection in Lean

Job Petrovčič; David Eliecer Narvaez Denis; Ljupčo Todorovski

Combining Textual and Structural Information for Premise Selection in Lean

Job Petrovčič, David Eliecer Narvaez Denis, Ljupčo Todorovski

TL;DR

Premise selection in Lean proof environments is hindered by ignoring library structure. The authors blend ByT5-based textual embeddings with a Relational Graph Convolutional Network over a heterogeneous Lean dependency graph to produce graph-aware representations for proof states and premises. On the LeanDojo Mathlib benchmark, the graph-augmented approach improves Recall@1, Recall@10, and MRR over a ReProver baseline, with additional gains from ensemble EMA, though ablations suggest training choices may drive much of the improvement. The work highlights the potential of incorporating structural dependencies into premise retrieval for tactic-based Lean proofs and points to future work on graph attention and alternative data splits.

Abstract

Premise selection is a key bottleneck for scaling theorem proving in large formal libraries. Yet existing language-based methods often treat premises in isolation, ignoring the web of dependencies that connects them. We present a graph-augmented approach that combines dense text embeddings of Lean formalizations with graph neural networks over a heterogeneous dependency graph capturing both state-premise and premise-premise relations. On the LeanDojo Benchmark, our method outperforms the ReProver language-based baseline by over 25\% across standard retrieval metrics. These results suggest that relational information is beneficial for premise selection.

Combining Textual and Structural Information for Premise Selection in Lean

TL;DR

Abstract

Combining Textual and Structural Information for Premise Selection in Lean

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)