Table of Contents
Fetching ...

HERMES: Towards Efficient and Verifiable Mathematical Reasoning in LLMs

Azim Ospanov, Zijin Feng, Jiacheng Sun, Haoli Bai, Xin Shen, Farzan Farnia

TL;DR

Hermes addresses the gap between informal LLM-based mathematical reasoning and formal, machine-checked proofs by interleaving stepwise informal reasoning with Lean4 formal verification. It introduces a four-module architecture (LLM, translator, prover, feedback) plus a memory block to preserve proof continuity, enabling iterative, verifiable reasoning with Lean4-backed signals. Across four challenging benchmarks and multiple base models, Hermes yields consistent accuracy improvements (average ~14% and up to 67% on AIME'25) and substantial efficiency gains (e.g., up to 80% reduction in inference FLOPs) compared with reward-based approaches. The accompanying open-source implementation demonstrates the practicality of scalable, interpretable, tool-augmented reasoning for formal mathematical tasks.

Abstract

Informal mathematics has been central to modern large language model (LLM) reasoning, offering flexibility and enabling efficient construction of arguments. However, purely informal reasoning is prone to logical gaps and subtle errors that are difficult to detect and correct. In contrast, formal theorem proving provides rigorous, verifiable mathematical reasoning, where each inference step is checked by a trusted compiler in systems such as Lean, but lacks the exploratory freedom of informal problem solving. This mismatch leaves current LLM-based math agents without a principled way to combine the strengths of both paradigms. In this work, we introduce Hermes, the first tool-assisted agent that explicitly interleaves informal reasoning with formally verified proof steps in Lean. The framework performs intermediate formal checking to prevent reasoning drift and employs a memory module that maintains proof continuity across long, multi-step reasoning chains, enabling both exploration and verification within a single workflow. We evaluate Hermes on four challenging mathematical reasoning benchmarks using LLMs of varying parameter scales, from small models to state-of-the-art systems. Across all settings, Hermes reliably improves the reasoning accuracy of base models while substantially reducing token usage and computational cost compared to reward-based approaches. On difficult datasets such as AIME'25, Hermes achieves up to a 67% accuracy improvement while using 80% fewer total inference FLOPs. The implementation and codebase are publicly available at https://github.com/aziksh-ospanov/HERMES.

HERMES: Towards Efficient and Verifiable Mathematical Reasoning in LLMs

TL;DR

Hermes addresses the gap between informal LLM-based mathematical reasoning and formal, machine-checked proofs by interleaving stepwise informal reasoning with Lean4 formal verification. It introduces a four-module architecture (LLM, translator, prover, feedback) plus a memory block to preserve proof continuity, enabling iterative, verifiable reasoning with Lean4-backed signals. Across four challenging benchmarks and multiple base models, Hermes yields consistent accuracy improvements (average ~14% and up to 67% on AIME'25) and substantial efficiency gains (e.g., up to 80% reduction in inference FLOPs) compared with reward-based approaches. The accompanying open-source implementation demonstrates the practicality of scalable, interpretable, tool-augmented reasoning for formal mathematical tasks.

Abstract

Informal mathematics has been central to modern large language model (LLM) reasoning, offering flexibility and enabling efficient construction of arguments. However, purely informal reasoning is prone to logical gaps and subtle errors that are difficult to detect and correct. In contrast, formal theorem proving provides rigorous, verifiable mathematical reasoning, where each inference step is checked by a trusted compiler in systems such as Lean, but lacks the exploratory freedom of informal problem solving. This mismatch leaves current LLM-based math agents without a principled way to combine the strengths of both paradigms. In this work, we introduce Hermes, the first tool-assisted agent that explicitly interleaves informal reasoning with formally verified proof steps in Lean. The framework performs intermediate formal checking to prevent reasoning drift and employs a memory module that maintains proof continuity across long, multi-step reasoning chains, enabling both exploration and verification within a single workflow. We evaluate Hermes on four challenging mathematical reasoning benchmarks using LLMs of varying parameter scales, from small models to state-of-the-art systems. Across all settings, Hermes reliably improves the reasoning accuracy of base models while substantially reducing token usage and computational cost compared to reward-based approaches. On difficult datasets such as AIME'25, Hermes achieves up to a 67% accuracy improvement while using 80% fewer total inference FLOPs. The implementation and codebase are publicly available at https://github.com/aziksh-ospanov/HERMES.

Paper Structure

This paper contains 27 sections, 8 figures, 9 tables.

Figures (8)

  • Figure 1: Overview of the Hermes framework. Hermes is a Lean4-driven, multi-modular reasoning agent integrating LLM reasoning with formal verification for reliable mathematical problem solving. It comprises four modules: an LLM that generates reasoning steps, a translator that formalizes these steps into Lean code, a prover that symbolically verifies their correctness, and a feedback module that returns verification signals for subsequent reasoning. This design enables iterative reasoning with improved correctness and efficiency.
  • Figure 2: Full Hermes framework with illustrative examples.
  • Figure 3: Average reasoning token usage per problem on MATH500, AIME’25, CollegeMath, and HardMath2 under Zero-Shot Chain-of-Thought, Hermes, and Reward-based Best-of-5 settings.
  • Figure 4: Scaling behavior of BoN across ORM, PRM, Safe and Majority vote. The green dashed line indicates Hermes performance at @1. The corresponding token consumption for each BoN is shown below as a one-dimensional heatmap.
  • Figure 5: Average TeraFLOPs per problem on MATH500, AIME’25, CollegeMath, and HardMath2 under Zero-Shot Chain-of-Thought, Hermes, and Reward-Based Best-of-5 settings.
  • ...and 3 more figures