Table of Contents
Fetching ...

Online Learnability of Chain-of-Thought Verifiers: Soundness and Completeness Trade-offs

Maria-Florina Balcan, Avrim Blum, Kiriaki Fragkia, Zhiyuan Li, Dravyansh Sharma

TL;DR

An online learning framework for learning chain-of-thought verifiers that, given a problem and a sequence of reasoning steps, check the correctness of the solution is proposed and novel extensions of the Littlestone dimension are introduced which tightly characterize the mistake bounds for learning a verifier in the realizable setting.

Abstract

Large language models with chain-of-thought generation have demonstrated great potential for producing complex mathematical proofs. However, their reasoning can often go astray, leading to increasing interest in formal and learned verifiers. A major challenge in learning verifiers, especially when their output will be used by the prover, is that this feedback loop may produce substantial distribution shift. Motivated by this challenge, we propose an online learning framework for learning chain-of-thought verifiers that, given a problem and a sequence of reasoning steps, check the correctness of the solution. Highlighting the asymmetric role of soundness (failure in catching errors in a proof) and completeness (flagging correct proofs as wrong) mistakes of the verifier, we introduce novel extensions of the Littlestone dimension which tightly characterize the mistake bounds for learning a verifier in the realizable setting. We provide optimal algorithms for finding the Pareto-frontier (the smallest total number of mistakes given a budget of soundness mistakes) as well as minimizing a linear combination of asymmetric costs. We further show how our learned verifiers can be used to boost the accuracy of a collection of weak provers, and enable generation of proofs beyond what they were trained on. With the mild assumption that one of the provers can generate the next reasoning step correctly with some minimal probability, we show how to learn a strong prover with small error and abstention rates.

Online Learnability of Chain-of-Thought Verifiers: Soundness and Completeness Trade-offs

TL;DR

An online learning framework for learning chain-of-thought verifiers that, given a problem and a sequence of reasoning steps, check the correctness of the solution is proposed and novel extensions of the Littlestone dimension are introduced which tightly characterize the mistake bounds for learning a verifier in the realizable setting.

Abstract

Large language models with chain-of-thought generation have demonstrated great potential for producing complex mathematical proofs. However, their reasoning can often go astray, leading to increasing interest in formal and learned verifiers. A major challenge in learning verifiers, especially when their output will be used by the prover, is that this feedback loop may produce substantial distribution shift. Motivated by this challenge, we propose an online learning framework for learning chain-of-thought verifiers that, given a problem and a sequence of reasoning steps, check the correctness of the solution. Highlighting the asymmetric role of soundness (failure in catching errors in a proof) and completeness (flagging correct proofs as wrong) mistakes of the verifier, we introduce novel extensions of the Littlestone dimension which tightly characterize the mistake bounds for learning a verifier in the realizable setting. We provide optimal algorithms for finding the Pareto-frontier (the smallest total number of mistakes given a budget of soundness mistakes) as well as minimizing a linear combination of asymmetric costs. We further show how our learned verifiers can be used to boost the accuracy of a collection of weak provers, and enable generation of proofs beyond what they were trained on. With the mild assumption that one of the provers can generate the next reasoning step correctly with some minimal probability, we show how to learn a strong prover with small error and abstention rates.
Paper Structure (29 sections, 18 theorems, 24 equations, 1 figure, 9 algorithms)

This paper contains 29 sections, 18 theorems, 24 equations, 1 figure, 9 algorithms.

Key Result

Proposition 3.0

For any finite class of verifiers, $H$, there exists an online verification algorithm that makes at most $O(\log |H|)$ mistakes. Moreover, there exists a class of verifiers such that any online learning algorithm must make at least $\Omega(\log |H|)$ mistakes.

Figures (1)

  • Figure 1: An SC-mistake tree for the class $H=\left\{\,h_i:\{0,1\}^4\to\{0,1\}\ \middle|\ i\in[4],\ h_i(x)=x_i\wedge \bigwedge_{j\in[4]\setminus\{i\}}\neg x_j \right\}$. We interpret output $0$ as $\textsc{no}$ and $1$ as $\textsc{yes}$.

Theorems & Definitions (44)

  • Definition 2.1: Littlestone dimension
  • Proposition 3.0
  • proof : Proof Sketch
  • Proposition 3.0
  • proof : Proof Sketch
  • Example 3.1: The River-Crossing Puzzle
  • Theorem 3.2
  • proof
  • Theorem 3.3
  • proof
  • ...and 34 more