Table of Contents
Fetching ...

Mitigating Legibility Tax with Decoupled Prover-Verifier Games

Yegon Kim, Juho Lee

TL;DR

A decoupled prover-verifier game where the equilibria correspond to faithful and checkable translators is formulated, to accommodate the new objective of translation.

Abstract

As large language models become increasingly capable, it is critical that their outputs can be easily checked by less capable systems. Prover-verifier games can be used to improve checkability of model outputs, but display a degradation in accuracy compared to a baseline trained only to maximize correctness -- a phenonemon named legibility tax. We propose a solution by decoupling the correctness from the checkability condition and instead training a "translator" model that turns a fixed solver model's solution into a checkable form. This allows us to first train the solver to maximize correctness, and then train the translator to translate the solver into a checkable form while retaining the solver's answer. To accommodate this new objective of translation, we formulate a decoupled prover-verifier game where the equilibria correspond to faithful and checkable translators.

Mitigating Legibility Tax with Decoupled Prover-Verifier Games

TL;DR

A decoupled prover-verifier game where the equilibria correspond to faithful and checkable translators is formulated, to accommodate the new objective of translation.

Abstract

As large language models become increasingly capable, it is critical that their outputs can be easily checked by less capable systems. Prover-verifier games can be used to improve checkability of model outputs, but display a degradation in accuracy compared to a baseline trained only to maximize correctness -- a phenonemon named legibility tax. We propose a solution by decoupling the correctness from the checkability condition and instead training a "translator" model that turns a fixed solver model's solution into a checkable form. This allows us to first train the solver to maximize correctness, and then train the translator to translate the solver into a checkable form while retaining the solver's answer. To accommodate this new objective of translation, we formulate a decoupled prover-verifier game where the equilibria correspond to faithful and checkable translators.
Paper Structure (18 sections, 1 theorem, 13 equations, 2 figures, 1 table)

This paper contains 18 sections, 1 theorem, 13 equations, 2 figures, 1 table.

Key Result

Theorem 1

In the verifier-leading Stackelberg game where the verifier's utility is $R_V$ and the translator's utility is $R_T$, the tuple $(v^*, \tau^*, \tau'^*)$ being an equilibrium is necessary and sufficient for the faithfulness of $\tau^*$ with respect to $s$, and for the completeness and soundness prope

Figures (2)

  • Figure 1: Round 1 and 2 of our decoupled prover-verifier game. All values are exponential moving averages with $\alpha=0.02$. Verifier score is the average logit output by the verifier, and faithfulness is the fraction of outputs from the faithful translator whose answer matches that of the solver.
  • Figure 2: Round 1 and 2 of the baseline prover-verifier game.

Theorems & Definitions (2)

  • Theorem 1
  • proof