Steering LLMs for Formal Theorem Proving
Shashank Kirtania, Arun Iyer
TL;DR
The paper investigates how informal natural-language reasoning influences formal theorem proving by LLMs and introduces activation steering, an inference-time intervention that injects a linear direction from NL context into the residual stream to guide proof construction without fine-tuning. Steering vectors are constructed via a difference-of-means on activations computed from paired NL-augmented and formal prompts, and are applied at carefully chosen layers to bias the model toward NL-informed reasoning. The authors demonstrate consistent improvements on MiniF2F and PutnamBench across multiple models and decoding strategies, reveal that late layers are most impactful, and show potential for cross-prover transfer without training. This work advances mechanistic interpretability of LLMs, provides a lightweight method to enhance formal proving, and suggests broader implications for transferable reasoning across proof systems.
Abstract
Recent advances in automated theorem proving use Large Language Models (LLMs) to translate informal mathematical statements into formal proofs. However, informal cues are often ambiguous or lack strict logical structure, making it hard for models to interpret them precisely. While existing methods achieve strong performance, little is known about how LLMs internally represent informal cues, or how these influence proof generation. To address this, we explore \textit{activation steering}, an inference-time intervention that identifies linear directions in residual activations associated with informal reasoning traces and adjusts them to improve proof construction without fine-tuning. This mechanism also yields interpretable information about how reasoning is internally encoded in the activation space of LLMs. We test our method for generating formal proofs from already-formalized theorems. Our contributions are twofold: (1) a novel activation-based intervention for guiding proof synthesis in LLMs; and (2) demonstration that this intervention improves performance under two decoding strategies (sampling and best-first search) without any further training.
