Models That Prove Their Own Correctness

Noga Amit; Shafi Goldwasser; Orr Paradise; Guy Rothblum

Models That Prove Their Own Correctness

Noga Amit, Shafi Goldwasser, Orr Paradise, Guy Rothblum

TL;DR

This work defines Self-Proving models that output an answer and simultaneously prove its correctness to a fixed verifier via an interactive proof, enabling input-specific trust guarantees. It develops two learning paradigms, Transcript Learning (TL) and Reinforcement Learning from Verifier Feedback (RLVF), with gradient formulations and (in TL) convergence guarantees under standard assumptions. Empirical validation on a GCD task shows substantial verifiability gains: TL yields moderate improvements, RLVF boosts performance further, and annotated transcripts push verifiability up to near-perfect levels. The framework integrates interactive proof theory with learnable models to enable verifiable, per-input correctness in domains where formal proofs are feasible, signaling a path toward trustworthy AI systems with formal guarantees.

Abstract

How can we trust the correctness of a learned model on a particular input of interest? Model accuracy is typically measured on average over a distribution of inputs, giving no guarantee for any fixed input. This paper proposes a theoretically-founded solution to this problem: to train Self-Proving models that prove the correctness of their output to a verification algorithm $V$ via an Interactive Proof. Self-Proving models satisfy that, with high probability over an input sampled from a given distribution, the model generates a correct output and successfully proves its correctness to $V$. The soundness property of $V$ guarantees that, for every input, no model can convince $V$ of the correctness of an incorrect output. Thus, a Self-Proving model proves correctness of most of its outputs, while all incorrect outputs (of any model) are detected by $V$. We devise and analyze two generic methods for learning Self-Proving models: Transcript Learning (TL) which relies on access to transcripts of accepting interactions, and Reinforcement Learning from Verifier Feedback (RLVF) which trains a model by emulating interactions with the verifier.

Models That Prove Their Own Correctness

TL;DR

Abstract

via an Interactive Proof. Self-Proving models satisfy that, with high probability over an input sampled from a given distribution, the model generates a correct output and successfully proves its correctness to

. The soundness property of

guarantees that, for every input, no model can convince

of the correctness of an incorrect output. Thus, a Self-Proving model proves correctness of most of its outputs, while all incorrect outputs (of any model) are detected by

. We devise and analyze two generic methods for learning Self-Proving models: Transcript Learning (TL) which relies on access to transcripts of accepting interactions, and Reinforcement Learning from Verifier Feedback (RLVF) which trains a model by emulating interactions with the verifier.

Paper Structure (45 sections, 4 theorems, 41 equations, 6 figures, 2 tables, 4 algorithms)

This paper contains 45 sections, 4 theorems, 41 equations, 6 figures, 2 tables, 4 algorithms.

Introduction
Related Work
ML and IPs.
Models that generate formal proofs.
Training on intermediate steps.
Subsequent work on RLVF.
Defining Self-Proving Models
Preliminaries
Self-Proving Models
A Definition for General Loss Functions and One-To-Many Relations
General (bounded) loss functions.
One-to-many-relations.
The General Definition
Algorithms for Learning Self-Proving Models
Transcript Learning
...and 30 more sections

Key Result

Lemma 4.2

Fix an input distribution $\mu$ over $\Sigma^\ast$ and a verifier $V$ with round complexity $R$ and answer length $L_a$. Fix an honest transcript generator $\mathcal{T}_V^\ast$. Let $\theta$ denote the parameters of the model $P_\theta$, let $A(\theta)$ be as defined above and let the terms $S(r)$,

Figures (6)

Figure 1: Self-Proving models. For input $x$, Self-Proving model $P_\theta$ generates an output $y$ and sends it to a Verification Algorithm $V$. Then, over $i \in [R]$ rounds, $V$ sends query $q_i$, and receives an answer $a_i$ from $P_\theta$. Finally, $V$ decides ("accept/reject") whether it is convinced that $y$ is a correct output for $x$.
Figure 2: Transcript Learning, visualized. To understand \ref{['alg:TL']}, consider the above visualization. In Phase 1, $N$ honest transcripts are collected by interacting an honest prover with the Verifier; these serve as samples from the transcript generator $\mathcal{T}^\ast_V(x)$. Phase 2 runs \ref{['alg:TL']}: for each transcript $\pi^\ast$ (lines 2–3) and each prefix $\pi_s$ (lines 4–6), the values $\alpha_s(\theta_i)$ and $\vec{d}_s(\theta_i)$ are computed via forward and backward passes (line 7), followed by a parameter update (line 8).
Figure 3: Verifiability with increasing amounts of annotation. $T$ is the number of steps added in Annotated Transcript Learning. Dashed lines indicate Euclidean depth, that bound the Verifiability of models that prove only for integers up to a certain number of steps. Each $T$ was run with three seeds, with mean $\pm$ standard error depicted. The upper graph provides a zoomed-in view of the 82% to 98% range from the lower graph, which spans a broader scale from 20% to 100%.
Figure 4: Annotated TL Verifiability as a function of the number of samples $N$. Each iteration (X axis) is a batch of 1024 samples from a dataset of $\approx$10M sequences. Every 10k iterations, Verifiability was evaluated on a held-out dataset of 1k inputs. $T$ is the number of steps in Annotated Transcript Learning (\ref{['fig:annot']}), and $T=0$ is non-annotated Transcript Learning. Each $T$ was run with three seeds, with mean depicted by the curve and standard error by the shaded area.
Figure 5: The number of prime divisors of a base $\omega(B)$ determines Verifiability. For each $o \in [4]$, we sampled 17 bases $B \in \{2, \dots, 1386 \}$ such that $\omega(B)=o$. A Self-Proving transformer was trained via Transcript Learning for twenty epochs on an identical dataset of 1024K samples encoded in base $B$. For each $\omega(B)$ we depict the mean $\pm$ standard error.
...and 1 more figures

Theorems & Definitions (18)

Definition 3.1: Correctness
Definition 3.2
Remark 3.3: Verifier efficiency
Definition 3.4: Self-Proving model
Remark 3.5: Verifiability $\implies$ correctness
Definition 3.6: Approximate correctness
Definition 3.7: \ref{['def:sys-ver']}, generalized
Remark 3.8: Connection to Interactive Proofs of Proximity
Definition 3.9: \ref{['def:ver_mod']}, generalized
Definition 4.1: Transcript generator
...and 8 more

Models That Prove Their Own Correctness

TL;DR

Abstract

Models That Prove Their Own Correctness

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (18)