Training Language Models to Win Debates with Self-Play Improves Judge Accuracy

Samuel Arnesen; David Rein; Julian Michael

Training Language Models to Win Debates with Self-Play Improves Judge Accuracy

Samuel Arnesen, David Rein, Julian Michael

TL;DR

In quantitative and qualitative comparisons between the authors' debate models and novel consultancy baselines, evidence is found that debate training encourages stronger and more informative arguments, showing promise that it can help provide high-quality supervision for tasks that are difficult to directly evaluate.

Abstract

We test the robustness of debate as a method of scalable oversight by training models to debate with data generated via self-play. In a long-context reading comprehension task, we find that language model based evaluators answer questions more accurately when judging models optimized to win debates. By contrast, we find no such relationship for consultancy models trained to persuade a judge without an opposing debater present. In quantitative and qualitative comparisons between our debate models and novel consultancy baselines, we find evidence that debate training encourages stronger and more informative arguments, showing promise that it can help provide high-quality supervision for tasks that are difficult to directly evaluate.

Training Language Models to Win Debates with Self-Play Improves Judge Accuracy

TL;DR

Abstract

Paper Structure (65 sections, 5 equations, 12 figures)

This paper contains 65 sections, 5 equations, 12 figures.

Introduction
Experimental Setup
Task Design
Debate Protocol
Baselines
Evaluation
Training Methods
Judge
Debaters and Consultants
Supervised Training
Self-Play DPO Training
Training Objective
Reward Function
Sampling Method
Training Procedure
...and 50 more sections

Figures (12)

Figure 1: Evaluation protocols. We use a simultaneous debate format where the debaters can only see speeches delivered by their opponent from previous turns. Consultancy differs from debate in that the debaters can never see arguments generated by an opponent.
Figure 2: Example transcript. This is an abbreviated transcript of a debate between two copies of a fully trained debate model. It concerns a short story that the debaters can read but the judge cannot. Verified quotes from the underlying text are written in red. See \ref{['app:example_transcripts']} for complete transcripts.
Figure 3: Judge training. Our judge is a finetuned version of GPT-4-Turbo. The resulting model is more accurate and better calibrated on the validation set for both debate and consultancy.
Figure 4: Debate and consultant training. We train Llama3-8B to convince the judge in both the debate and consultancy mediums using SFT and DPO. Depicted are win rates over the final iteration of DPO training, initialized from the SFT model. Overall win rates for each debate checkpoint (left) are calculated on the basis of Elo scores inferred from head-to-head win rates (right).
Figure 5: Skill--Accuracy Relationship. The judge's accuracy increases alongside the skill level of the debaters. For consultancy, this relationship is indistinguishable from noise.
...and 7 more figures

Training Language Models to Win Debates with Self-Play Improves Judge Accuracy

TL;DR

Abstract

Training Language Models to Win Debates with Self-Play Improves Judge Accuracy

Authors

TL;DR

Abstract

Table of Contents

Figures (12)