Table of Contents
Fetching ...

Library Learning Doesn't: The Curious Case of the Single-Use "Library"

Ian Berlot-Attwell, Frank Rudzicz, Xujie Si

TL;DR

It is found that function reuse is extremely infrequent on miniF2F and MATH, and followup ablation experiments suggest that, rather than reuse, self-correction and self-consistency are the primary drivers of the observed performance gains.

Abstract

Advances in Large Language Models (LLMs) have spurred a wave of LLM library learning systems for mathematical reasoning. These systems aim to learn a reusable library of tools, such as formal Isabelle lemmas or Python programs that are tailored to a family of tasks. Many of these systems are inspired by the human structuring of knowledge into reusable and extendable concepts, but do current methods actually learn reusable libraries of tools? We study two library learning systems for mathematics which both reported increased accuracy: LEGO-Prover and TroVE. We find that function reuse is extremely infrequent on miniF2F and MATH. Our followup ablation experiments suggest that, rather than reuse, self-correction and self-consistency are the primary drivers of the observed performance gains. Our code and data are available at https://github.com/ikb-a/curious-case

Library Learning Doesn't: The Curious Case of the Single-Use "Library"

TL;DR

It is found that function reuse is extremely infrequent on miniF2F and MATH, and followup ablation experiments suggest that, rather than reuse, self-correction and self-consistency are the primary drivers of the observed performance gains.

Abstract

Advances in Large Language Models (LLMs) have spurred a wave of LLM library learning systems for mathematical reasoning. These systems aim to learn a reusable library of tools, such as formal Isabelle lemmas or Python programs that are tailored to a family of tasks. Many of these systems are inspired by the human structuring of knowledge into reusable and extendable concepts, but do current methods actually learn reusable libraries of tools? We study two library learning systems for mathematics which both reported increased accuracy: LEGO-Prover and TroVE. We find that function reuse is extremely infrequent on miniF2F and MATH. Our followup ablation experiments suggest that, rather than reuse, self-correction and self-consistency are the primary drivers of the observed performance gains. Our code and data are available at https://github.com/ikb-a/curious-case

Paper Structure

This paper contains 15 sections, 7 figures, 5 tables.

Figures (7)

  • Figure 1: LEGO-Prover performance on a subset of the miniF2F validation split. The ablated model cannot reuse lemmas and performs similarly. The shaded region is one standard deviation, capturing variations in LLM output and race conditions.
  • Figure 2: Example of verbatim reuse by the LEGO-Prover. The input lemma is reproduced exactly in the Prover's output.
  • Figure 3: Example of name reuse by the LEGO-Prover. Only the name of the input lemma needs to be reproduced exactly in the output. In this case, the body of the input lemma has been significantly adjusted. Note Figure \ref{['fig:app_lp_verbatim']} is also an example of name reuse, as the input lemma's name appears in the solution (in that particular case, along with the rest of the lemma).
  • Figure 4: LEGO-Prover input lemmas (left) and found proof (right). The proof proves that $\forall k \in \mathbb{R}:$ if $x = (13 - \sqrt{131}) / 4$ and $2x^2 - 13x + k = 0$ then $k = 19/4$. See Figure \ref{['fig:app_lp_mathd_latex']} for a typeset approximation, and commentary of LEGO-Prover's use (and failure to use) the input lemmas.
  • Figure 5: A typset approximation of LEGO-Prover input lemmas (left) and found proof (right). The proof proves that $\forall k \in \mathbb{R}:$ if $x = (13 - \sqrt{131}) / 4$ and $2x^2 - 13x + k = 0$ then $k = 19/4$. See Figure \ref{['fig:app_lp_mathd']} for the original Isabelle lemmas and proof. Note that skill 1 may have been indirectly used in rewriting $2x^2 - 13x + k = 0$ as $k = 13x - 2x^2$. Skills 2, 3 and 4 do not seem to be used directly or indirectly; furthermore skills 2 and 4 are the same lemma. Their use of smt and metis may have encouraged the prover to use these same tools.
  • ...and 2 more figures