Table of Contents
Fetching ...

Proof Flow: Preliminary Study on Generative Flow Network Language Model Tuning for Formal Reasoning

Matthew Ho, Vincent Zhu, Xiaoyin Chen, Moksh Jain, Nikolay Malkin, Edwin Zhang

TL;DR

This paper presents a proof of concept in the domain of formal reasoning, specifically in the Neural Theorem Proving (NTP) setting, where proofs specified in a formal language such as Lean can be deterministically and objectively verified.

Abstract

Reasoning is a fundamental substrate for solving novel and complex problems. Deliberate efforts in learning and developing frameworks around System 2 reasoning have made great strides, yet problems of sufficient complexity remain largely out of reach for open models. To address this gap, we examine the potential of Generative Flow Networks as a fine-tuning method for LLMs to unlock advanced reasoning capabilities. In this paper, we present a proof of concept in the domain of formal reasoning, specifically in the Neural Theorem Proving (NTP) setting, where proofs specified in a formal language such as Lean can be deterministically and objectively verified. Unlike classical reward-maximization reinforcement learning, which frequently over-exploits high-reward actions and fails to effectively explore the state space, GFlowNets have emerged as a promising approach for sampling compositional objects, improving generalization, and enabling models to maintain diverse hypotheses. Our early results demonstrate GFlowNet fine-tuning's potential for enhancing model performance in a search setting, which is especially relevant given the paradigm shift towards inference time compute scaling and "thinking slowly."

Proof Flow: Preliminary Study on Generative Flow Network Language Model Tuning for Formal Reasoning

TL;DR

This paper presents a proof of concept in the domain of formal reasoning, specifically in the Neural Theorem Proving (NTP) setting, where proofs specified in a formal language such as Lean can be deterministically and objectively verified.

Abstract

Reasoning is a fundamental substrate for solving novel and complex problems. Deliberate efforts in learning and developing frameworks around System 2 reasoning have made great strides, yet problems of sufficient complexity remain largely out of reach for open models. To address this gap, we examine the potential of Generative Flow Networks as a fine-tuning method for LLMs to unlock advanced reasoning capabilities. In this paper, we present a proof of concept in the domain of formal reasoning, specifically in the Neural Theorem Proving (NTP) setting, where proofs specified in a formal language such as Lean can be deterministically and objectively verified. Unlike classical reward-maximization reinforcement learning, which frequently over-exploits high-reward actions and fails to effectively explore the state space, GFlowNets have emerged as a promising approach for sampling compositional objects, improving generalization, and enabling models to maintain diverse hypotheses. Our early results demonstrate GFlowNet fine-tuning's potential for enhancing model performance in a search setting, which is especially relevant given the paradigm shift towards inference time compute scaling and "thinking slowly."

Paper Structure

This paper contains 17 sections, 8 equations, 5 figures, 2 tables, 2 algorithms.

Figures (5)

  • Figure 1: Proof Flow System. We extract and filter ground-truth proofs and theorems from LEAN 4's standard math library: mathlib. For reward model training data, we sample candidate tactics using a base model and label with Lean. We use Best First Search pearl1984heuristics for evaluation. The GFlowNet diagram is a frame of an animation from bengio_GFlowNet_2022
  • Figure 2: Evaluation on hold-out set of 20 theorems unseen during train time. Validation run every 20 gradient steps. GFlowNet-OO refers to Online Only, GFlownet-BR-OO refers to Binary Reward and Online Only, while GFlowNet refers to the full method. SFT refers to Supervised Fine-Tuning, or just maximizing log likelihood on the ground truth trajectory. For exact theorem names and lengths see \ref{['tab:accuracy_results']}.
  • Figure 3: Training Trajectory Balance loss over 2000 gradient steps. Loss is smoothed using simple moving average over 100 steps. Interestingly, GFlowNet observes instability in training around episode 1000. Note that the full GFlowNet is the only off-policy method, as opposed to the on-policy GFlowNet-OO and GFlowNet-BR-OO.
  • Figure 4: SFT Baseline Train Loss over Time
  • Figure 5: SFT Baseline Validation Loss over Time