RocqSmith: Can Automatic Optimization Forge Better Proof Agents?
Andrei Kozyrev, Nikita Khramov, Denis Lochmelis, Valerio Morelli, Gleb Solovev, Anton Podkopaev
TL;DR
The paper investigates whether automatic agent optimization can improve Rocq proof agents in interactive theorem proving, aiming to reduce manual engineering. It evaluates a suite of optimization methods from DSPy and related work on a simplified RocqStar baseline using the IMM benchmark and GPT-4.1, comparing prompt-centric, context-based, and topology-based approaches. Key findings show that simple few-shot bootstrapping offers the most consistent gains, particularly for ReAct-style agents, while more sophisticated optimizers yield limited or unstable improvements and none reach a hand-engineered state-of-the-art agent. The work demonstrates that automatic optimization can partially alleviate manual tuning in formal verification pipelines but falls short of fully replacing expert-crafted designs, highlighting the need for refined context strategies and robust prompt engineering in future efforts.
Abstract
This work studies the applicability of automatic AI agent optimization methods to real-world agents in formal verification settings, focusing on automated theorem proving in Rocq as a representative and challenging domain. We evaluate how different automatic agent optimizers perform when applied to the task of optimizing a Rocq proof-generation agent, and assess whether parts of the fine-grained tuning of agentic systems, such as prompt design, contextual knowledge, and control strategies, can be automated. Our results show that while several optimizers yield measurable improvements, simple few-shot bootstrapping is the most consistently effective; however, none of the studied methods matches the performance of a carefully engineered state-of-the-art proof agent.
