Theorem Prover as a Judge for Synthetic Data Generation
Joshua Ong Jun Leang, Giwon Hong, Wenda Li, Shay B. Cohen
TL;DR
The paper tackles the challenge of validating intermediate reasoning steps in LLM-based mathematical problem solving by introducing a verification-centric pipeline: iterative autoformalisation to increase Lean prover execution rates, TP-as-a-Judge to rigorously assess LLM reasoning via formal verification, and RLTPF to replace human annotation with prover feedback during RLHF. Across multiple LLMs, the approach achieves competitive gains on standard benchmarks using a compact synthetic dataset of $3{,}508$ samples, illustrating data-efficient improvements driven by high-quality intermediate reasoning. The work underscores the importance of formal verification in synthetic data generation and demonstrates robust, resource-conscious gains, while also highlighting limitations in domain applicability and computational cost. Overall, TP-as-a-Judge with RLTPF offers a principled, scalable direction for improving mathematical reasoning in LLMs through formalizable feedback loops.
Abstract
The demand for synthetic data in mathematical reasoning has increased due to its potential to enhance the mathematical capabilities of large language models (LLMs). However, ensuring the validity of intermediate reasoning steps remains a significant challenge, affecting data quality. While formal verification via theorem provers effectively validates LLM reasoning, the autoformalisation of mathematical proofs remains error-prone. In response, we introduce iterative autoformalisation, an approach that iteratively refines theorem prover formalisation to mitigate errors, thereby increasing the execution rate on the Lean prover from 60% to 87%. Building upon that, we introduce Theorem Prover as a Judge (TP-as-a-Judge), a method that employs theorem prover formalisation to rigorously assess LLM intermediate reasoning, effectively integrating autoformalisation with synthetic data generation. Finally, we present Reinforcement Learning from Theorem Prover Feedback (RLTPF), a framework that replaces human annotation with theorem prover feedback in Reinforcement Learning from Human Feedback (RLHF). Across multiple LLMs, applying TP-as-a-Judge and RLTPF improves benchmarks with only 3,508 samples, achieving 5.56% accuracy gain on Mistral-7B for MultiArith, 6.00% on Llama-2-7B for SVAMP, and 3.55% on Llama-3.1-8B for AQUA.
