Table of Contents
Fetching ...

Steering LLMs for Formal Theorem Proving

Shashank Kirtania, Arun Iyer

TL;DR

The paper investigates how informal natural-language reasoning influences formal theorem proving by LLMs and introduces activation steering, an inference-time intervention that injects a linear direction from NL context into the residual stream to guide proof construction without fine-tuning. Steering vectors are constructed via a difference-of-means on activations computed from paired NL-augmented and formal prompts, and are applied at carefully chosen layers to bias the model toward NL-informed reasoning. The authors demonstrate consistent improvements on MiniF2F and PutnamBench across multiple models and decoding strategies, reveal that late layers are most impactful, and show potential for cross-prover transfer without training. This work advances mechanistic interpretability of LLMs, provides a lightweight method to enhance formal proving, and suggests broader implications for transferable reasoning across proof systems.

Abstract

Recent advances in automated theorem proving use Large Language Models (LLMs) to translate informal mathematical statements into formal proofs. However, informal cues are often ambiguous or lack strict logical structure, making it hard for models to interpret them precisely. While existing methods achieve strong performance, little is known about how LLMs internally represent informal cues, or how these influence proof generation. To address this, we explore \textit{activation steering}, an inference-time intervention that identifies linear directions in residual activations associated with informal reasoning traces and adjusts them to improve proof construction without fine-tuning. This mechanism also yields interpretable information about how reasoning is internally encoded in the activation space of LLMs. We test our method for generating formal proofs from already-formalized theorems. Our contributions are twofold: (1) a novel activation-based intervention for guiding proof synthesis in LLMs; and (2) demonstration that this intervention improves performance under two decoding strategies (sampling and best-first search) without any further training.

Steering LLMs for Formal Theorem Proving

TL;DR

The paper investigates how informal natural-language reasoning influences formal theorem proving by LLMs and introduces activation steering, an inference-time intervention that injects a linear direction from NL context into the residual stream to guide proof construction without fine-tuning. Steering vectors are constructed via a difference-of-means on activations computed from paired NL-augmented and formal prompts, and are applied at carefully chosen layers to bias the model toward NL-informed reasoning. The authors demonstrate consistent improvements on MiniF2F and PutnamBench across multiple models and decoding strategies, reveal that late layers are most impactful, and show potential for cross-prover transfer without training. This work advances mechanistic interpretability of LLMs, provides a lightweight method to enhance formal proving, and suggests broader implications for transferable reasoning across proof systems.

Abstract

Recent advances in automated theorem proving use Large Language Models (LLMs) to translate informal mathematical statements into formal proofs. However, informal cues are often ambiguous or lack strict logical structure, making it hard for models to interpret them precisely. While existing methods achieve strong performance, little is known about how LLMs internally represent informal cues, or how these influence proof generation. To address this, we explore \textit{activation steering}, an inference-time intervention that identifies linear directions in residual activations associated with informal reasoning traces and adjusts them to improve proof construction without fine-tuning. This mechanism also yields interpretable information about how reasoning is internally encoded in the activation space of LLMs. We test our method for generating formal proofs from already-formalized theorems. Our contributions are twofold: (1) a novel activation-based intervention for guiding proof synthesis in LLMs; and (2) demonstration that this intervention improves performance under two decoding strategies (sampling and best-first search) without any further training.

Paper Structure

This paper contains 26 sections, 3 equations, 7 figures, 3 tables, 1 algorithm.

Figures (7)

  • Figure 1: Steering Vectors are computed as difference of means for $p^{+}$ and $p$
  • Figure 2: Cosine similarity across model's residual stream activations for each layer.
  • Figure 3: Pass rates at $2\times32\times600$ for InternLM2.5-StepProver on miniF2F.
  • Figure 4: Layer-wise ablation results.
  • Figure 5: Pass rates on miniF2F with varying search budgets.
  • ...and 2 more figures