Table of Contents
Fetching ...

LLM Library Learning Fails: A LEGO-Prover Case Study

Ian Berlot-Attwell, Frank Rudzicz, Xujie Si

TL;DR

A deep dive into one such system, LEGO-Prover, which purports to learn reusable lemmas for mathematical reasoning is performed, finding that LEGO-Prover does not in fact improve over the simple baseline of prompting the model - the improvements in task accuracy vanish once computational cost is accounted for.

Abstract

Recent advancements in the coding, reasoning, and tool-using abilities of LLMs have spurred interest in library learning (i.e., online learning through the creation, storage, and retrieval of reusable and composable functions, knowledge, checklists, or lemmas). Such systems often promise improved task performance through the automatic creation of broadly applicable tools, as well as superior computational performance through the caching of reasoning (i.e., the storage of generated tools). However, we find strong reason to be skeptical. We perform a deep dive into one such system, LEGO-Prover, which purports to learn reusable lemmas for mathematical reasoning. We find no evidence of the direct reuse of learned lemmas, and find evidence against the soft reuse of learned lemmas (i.e., reuse by modifying relevant examples). Crucially, we find that LEGO-Prover does not in fact improve over the simple baseline of prompting the model - the improvements in task accuracy vanish once computational cost is accounted for. Our findings suggest that serious misconceptions exist as to the effectiveness of these techniques, that a serious re-examination of the state of LLM-based library learning is required, and that we require much stronger standards for evaluation including behavioural analysis and ensuring that an equal computational budget is used for baselines.

LLM Library Learning Fails: A LEGO-Prover Case Study

TL;DR

A deep dive into one such system, LEGO-Prover, which purports to learn reusable lemmas for mathematical reasoning is performed, finding that LEGO-Prover does not in fact improve over the simple baseline of prompting the model - the improvements in task accuracy vanish once computational cost is accounted for.

Abstract

Recent advancements in the coding, reasoning, and tool-using abilities of LLMs have spurred interest in library learning (i.e., online learning through the creation, storage, and retrieval of reusable and composable functions, knowledge, checklists, or lemmas). Such systems often promise improved task performance through the automatic creation of broadly applicable tools, as well as superior computational performance through the caching of reasoning (i.e., the storage of generated tools). However, we find strong reason to be skeptical. We perform a deep dive into one such system, LEGO-Prover, which purports to learn reusable lemmas for mathematical reasoning. We find no evidence of the direct reuse of learned lemmas, and find evidence against the soft reuse of learned lemmas (i.e., reuse by modifying relevant examples). Crucially, we find that LEGO-Prover does not in fact improve over the simple baseline of prompting the model - the improvements in task accuracy vanish once computational cost is accounted for. Our findings suggest that serious misconceptions exist as to the effectiveness of these techniques, that a serious re-examination of the state of LLM-based library learning is required, and that we require much stronger standards for evaluation including behavioural analysis and ensuring that an equal computational budget is used for baselines.

Paper Structure

This paper contains 25 sections, 5 figures, 8 tables.

Figures (5)

  • Figure 1: These plots demonstrate that LEGO-Prover exhibits soft use, but not reuse, of lemmas. The top row concerns use. It plots the percentage of lemmas meeting a given the threshold for soft use in at least one solution (see Section \ref{['sec:softusescore']} for the definition of this score). The lemmas available to the Prover are shown in the solid blue line, these are the only lemmas for which use is possible. As expected, in the top row this line remains the highest for large values of the threshold, surpassing the non-retrieved lemmas (red dashed line). However, in the bottom row we consider the proportion of lemmas meeting the soft use threshold in two tasks (i.e., reused by LEGO-Prover). Unlike in the top row, there is no significant difference between the lines -- suggesting retrieved lemmas are only meaningful in a single task (consequently, used, not reused). See Section \ref{['sec:use']} for a detailed explanation.
  • Figure 2: We define a normalized score for the degree to which a lemma is used in a proof (i.e., soft use). Using this score and a threshold, we can determine whether LEGO-Prover's solution to a task soft-uses the retrieved lemmas provided in the Prover's context. In blue we plot the percentage of solutions meeting the desired threshold for soft use. In orange, we plot the percentage of solutions meeting the threshold in a lemma that is also used at the same level in at least one other solution. We observe that the lines for reuse rapidly reach zero for even moderate similarity thresholds wheras the lines for use remain relatively high. This suggests that lemmas are used by the Prover in generating solutions, but that there tends to be only one task that exploits a given lemma.
  • Figure 3: Average LEGO-Prover and Draft-Sketch-Prove performance compared by average cost in USD. The shaded region is 1 standard deviation based on 3 trials. Draft-Sketch-Prove has strongly superior performance using 4o-mini, and comparable performance on gpt-4o and o3-mini. The vertical line represents the number of Draft-Sketch-Prove prover attempts equal to LEGO-Prover's. For reasons of cost o3-mini was run on a random 10% subset of the test set and gpt-4o was run on the same random 10% subset as well as a 20% subset. Note that as the cost of o3-mini was over twice our predicted cost we were unable to run Draft-Sketch-Prove for a comparable budget. See Appendix \ref{['app:cost_plot_creation']} for the details of how this plot was generated.
  • Figure 4: For LEGO-Prover and Draft-Sketch-Prove we compare 5 runs with our default prompt versus 5 runs with different paraphrased prompts. We evaluate on a subset of the validation set. We find that the systems are fairly robust to minor changes in wording -- performance is typically within 1 standard deviation. These results are for a 5% subset of the validation split.
  • Figure 5: For LEGO-Prover and Draft-Sketch-Prove we compare 5 runs with our default prompt versus 5 runs with different paraphrased prompts. We evaluate on a subset of the test set. We find that these system are fairly robust to minor changes in wording -- performance is typically within 1 standard deviation. These results are for a 5% subset of the test split.