Mitigating Legibility Tax with Decoupled Prover-Verifier Games

Yegon Kim; Juho Lee

Mitigating Legibility Tax with Decoupled Prover-Verifier Games

Yegon Kim, Juho Lee

TL;DR

A decoupled prover-verifier game where the equilibria correspond to faithful and checkable translators is formulated, to accommodate the new objective of translation.

Abstract

As large language models become increasingly capable, it is critical that their outputs can be easily checked by less capable systems. Prover-verifier games can be used to improve checkability of model outputs, but display a degradation in accuracy compared to a baseline trained only to maximize correctness -- a phenonemon named legibility tax. We propose a solution by decoupling the correctness from the checkability condition and instead training a "translator" model that turns a fixed solver model's solution into a checkable form. This allows us to first train the solver to maximize correctness, and then train the translator to translate the solver into a checkable form while retaining the solver's answer. To accommodate this new objective of translation, we formulate a decoupled prover-verifier game where the equilibria correspond to faithful and checkable translators.

Mitigating Legibility Tax with Decoupled Prover-Verifier Games

TL;DR

A decoupled prover-verifier game where the equilibria correspond to faithful and checkable translators is formulated, to accommodate the new objective of translation.

Abstract

Paper Structure (18 sections, 1 theorem, 13 equations, 2 figures, 1 table)

This paper contains 18 sections, 1 theorem, 13 equations, 2 figures, 1 table.

Introduction
Method
Background
Decoupled Prover-Verifier Game
Optimization of Decoupled Prover-Verifier Game
Experiments
Experimental Setup
Results
Training dynamics.
Test set accuracy.
Conclusion
Future directions.
Proof of Theorems
Further Details of the Experimental Setup
Reproduction of Prover-Verifier Games (Baseline)
...and 3 more sections

Key Result

Theorem 1

In the verifier-leading Stackelberg game where the verifier's utility is $R_V$ and the translator's utility is $R_T$, the tuple $(v^*, \tau^*, \tau'^*)$ being an equilibrium is necessary and sufficient for the faithfulness of $\tau^*$ with respect to $s$, and for the completeness and soundness prope

Figures (2)

Figure 1: Round 1 and 2 of our decoupled prover-verifier game. All values are exponential moving averages with $\alpha=0.02$. Verifier score is the average logit output by the verifier, and faithfulness is the fraction of outputs from the faithful translator whose answer matches that of the solver.
Figure 2: Round 1 and 2 of the baseline prover-verifier game.

Theorems & Definitions (2)

Theorem 1
proof

Mitigating Legibility Tax with Decoupled Prover-Verifier Games

TL;DR

Abstract

Mitigating Legibility Tax with Decoupled Prover-Verifier Games

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (2)