Table of Contents
Fetching ...

miniCodeProps: a Minimal Benchmark for Proving Code Properties

Evan Lohn, Sean Welleck

Abstract

AI agents have shown initial promise in automating mathematical theorem proving in proof assistants such as Lean. The same proof assistants can be used to verify the correctness of code by pairing code with specifications and proofs that the specifications hold. Automating the writing of code, specifications, and proofs could lower the cost of verification, or, ambitiously, enable an AI agent to output safe, provably correct code. However, it remains unclear whether current neural theorem provers can automatically verify even relatively simple programs. We present miniCodeProps, a benchmark of 201 program specifications in the Lean proof assistant, aimed at the subproblem of automatically generating a proof for a provided program and specification. miniCodeProps contains specifications about simple, self-contained programs (e.g., lists, natural numbers, binary trees) with varied proof difficulty. Despite its simplicity, miniCodeProps is sufficient to break current LLM-based provers, with state-of-the-art methods showing promise on the easy properties in miniCodeProps, yet failing to prove nearly all of the medium and hard properties. We publicly release miniCodeProps as a benchmark for furthering automated theorem proving in the context of formally verified code.

miniCodeProps: a Minimal Benchmark for Proving Code Properties

Abstract

AI agents have shown initial promise in automating mathematical theorem proving in proof assistants such as Lean. The same proof assistants can be used to verify the correctness of code by pairing code with specifications and proofs that the specifications hold. Automating the writing of code, specifications, and proofs could lower the cost of verification, or, ambitiously, enable an AI agent to output safe, provably correct code. However, it remains unclear whether current neural theorem provers can automatically verify even relatively simple programs. We present miniCodeProps, a benchmark of 201 program specifications in the Lean proof assistant, aimed at the subproblem of automatically generating a proof for a provided program and specification. miniCodeProps contains specifications about simple, self-contained programs (e.g., lists, natural numbers, binary trees) with varied proof difficulty. Despite its simplicity, miniCodeProps is sufficient to break current LLM-based provers, with state-of-the-art methods showing promise on the easy properties in miniCodeProps, yet failing to prove nearly all of the medium and hard properties. We publicly release miniCodeProps as a benchmark for furthering automated theorem proving in the context of formally verified code.
Paper Structure (24 sections, 2 figures, 3 tables)

This paper contains 24 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Formal verification of code in an interactive proof assistant (here, Lean) consists of writing (i) code (e.g., an implementation of a Tree), (ii) a property that you want to verify (e.g. that the size of a balanced tree is odd), and (iii) a proof that the code satisfies the property. The underlying language verifies that the proof is correct, providing certainty that the property holds. miniCodeProps focuses on the important subproblem of proof generation: given a property and associated code, a model must generate a proof (e.g., shown here in green) that the property holds. miniCodeProps contains 201 properties about lightweight, self-contained code blocks, and measures a range of proving abilities.
  • Figure 2: miniCodeProps is sourced from Tons of Inductive Problems tip, a collection of programs and specifications written in Haskell. We translate the programs into Lean, and write and prove termination lemmas that are needed to state and prove properties of recursive functions.