Online Learnability of Chain-of-Thought Verifiers: Soundness and Completeness Trade-offs

Maria-Florina Balcan; Avrim Blum; Kiriaki Fragkia; Zhiyuan Li; Dravyansh Sharma

Online Learnability of Chain-of-Thought Verifiers: Soundness and Completeness Trade-offs

Maria-Florina Balcan, Avrim Blum, Kiriaki Fragkia, Zhiyuan Li, Dravyansh Sharma

TL;DR

An online learning framework for learning chain-of-thought verifiers that, given a problem and a sequence of reasoning steps, check the correctness of the solution is proposed and novel extensions of the Littlestone dimension are introduced which tightly characterize the mistake bounds for learning a verifier in the realizable setting.

Abstract

Large language models with chain-of-thought generation have demonstrated great potential for producing complex mathematical proofs. However, their reasoning can often go astray, leading to increasing interest in formal and learned verifiers. A major challenge in learning verifiers, especially when their output will be used by the prover, is that this feedback loop may produce substantial distribution shift. Motivated by this challenge, we propose an online learning framework for learning chain-of-thought verifiers that, given a problem and a sequence of reasoning steps, check the correctness of the solution. Highlighting the asymmetric role of soundness (failure in catching errors in a proof) and completeness (flagging correct proofs as wrong) mistakes of the verifier, we introduce novel extensions of the Littlestone dimension which tightly characterize the mistake bounds for learning a verifier in the realizable setting. We provide optimal algorithms for finding the Pareto-frontier (the smallest total number of mistakes given a budget of soundness mistakes) as well as minimizing a linear combination of asymmetric costs. We further show how our learned verifiers can be used to boost the accuracy of a collection of weak provers, and enable generation of proofs beyond what they were trained on. With the mild assumption that one of the provers can generate the next reasoning step correctly with some minimal probability, we show how to learn a strong prover with small error and abstention rates.

Online Learnability of Chain-of-Thought Verifiers: Soundness and Completeness Trade-offs

TL;DR

Abstract

Paper Structure (29 sections, 18 theorems, 24 equations, 1 figure, 9 algorithms)

This paper contains 29 sections, 18 theorems, 24 equations, 1 figure, 9 algorithms.

Introduction
Our Contributions
Related Work
Setting & Background
Problem statements, reasoning traces, and verifiers.
Online chain-of-thought verification.
Background: Littlestone Dimension
Mistake tree.
Online Chain-of-Thought Verification
Warm-up: Sound vs. Non-sound Online Verification
Reduction to Verifying Prefixes
Prefix Verification.
Prefix Verification with a Soundness Mistake Budget
Soundness-Completeness (SC)-mistake tree.
Online Verification with Linear Cost Objective
...and 14 more sections

Key Result

Proposition 3.0

For any finite class of verifiers, $H$, there exists an online verification algorithm that makes at most $O(\log |H|)$ mistakes. Moreover, there exists a class of verifiers such that any online learning algorithm must make at least $\Omega(\log |H|)$ mistakes.

Figures (1)

Figure 1: An SC-mistake tree for the class $H=\left\{\,h_i:\{0,1\}^4\to\{0,1\}\ \middle|\ i\in[4],\ h_i(x)=x_i\wedge \bigwedge_{j\in[4]\setminus\{i\}}\neg x_j \right\}$. We interpret output $0$ as $\textsc{no}$ and $1$ as $\textsc{yes}$.

Theorems & Definitions (44)

Definition 2.1: Littlestone dimension
Proposition 3.0
proof : Proof Sketch
Proposition 3.0
proof : Proof Sketch
Example 3.1: The River-Crossing Puzzle
Theorem 3.2
proof
Theorem 3.3
proof
...and 34 more

Online Learnability of Chain-of-Thought Verifiers: Soundness and Completeness Trade-offs

TL;DR

Abstract

Online Learnability of Chain-of-Thought Verifiers: Soundness and Completeness Trade-offs

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (44)