Conjecturing: An Overlooked Step in Formal Mathematical Reasoning
Jasivan Alex Sivakumar, Philipp Borchert, Ronald Cardenas, Gerasimos Lampouras
TL;DR
This paper reframes autoformalisation as two intertwined tasks by treating conjecturing as a separate prerequisite in formal mathematical reasoning. It introduces ConjectureBench, a dataset and evaluation framework with ConJudge and equiv_rfl metrics to separately assess conjecturing, and Lean-FIRe, an inference-time method that interleaves informal Chain-of-Thought with Lean-of-Thought reasoning. Empirical results show that withholding conjectures substantially degrades autoformalisation performance, even for strong models like GPT-4.1, while Lean-FIRe yields meaningful gains and achieves first end-to-end autoformalisation on PutnamBench 'no-answer' problems for some models, underscoring that latent mathematical knowledge exists but requires targeted guidance to conjecture and formalise effectively. The work suggests a clear research agenda: develop richer conjecturing data, design inference-time strategies, and train models to separate and reintegrate conjecturing within the autoformalisation pipeline.
Abstract
Autoformalisation, the task of expressing informal mathematical statements in formal language, is often viewed as a direct translation process. This, however, disregards a critical preceding step: conjecturing. Many mathematical problems cannot be formalised directly without first conjecturing a conclusion such as an explicit answer, or a specific bound. Since Large Language Models (LLMs) already struggle with autoformalisation, and the evaluation of their conjecturing ability is limited and often entangled within autoformalisation or proof, it is particularly challenging to understand its effect. To address this gap, we augment existing datasets to create ConjectureBench, and redesign the evaluation framework and metric specifically to measure the conjecturing capabilities of LLMs both as a distinct task and within the autoformalisation pipeline. Our evaluation of foundational models, including GPT-4.1 and DeepSeek-V3.1, reveals that their autoformalisation performance is substantially overestimated when the conjecture is accounted for during evaluation. However, the conjecture should not be assumed to be provided. We design an inference-time method, Lean-FIRe to improve conjecturing and autoformalisation, which, to the best of our knowledge, achieves the first successful end-to-end autoformalisation of 13 PutnamBench problems with GPT-4.1 and 7 with DeepSeek-V3.1. We demonstrate that while LLMs possess the requisite knowledge to generate accurate conjectures, improving autoformalisation performance requires treating conjecturing as an independent task, and investigating further how to correctly integrate it within autoformalisation. Finally, we provide forward-looking guidance to steer future research toward improving conjecturing, an overlooked step of formal mathematical reasoning.
