Table of Contents
Fetching ...

On scalable oversight with weak LLMs judging strong LLMs

Zachary Kenton, Noah Y. Siegel, János Kramár, Jonah Brown-Cohen, Samuel Albanie, Jannis Bulian, Rishabh Agarwal, David Lindner, Yunhao Tang, Noah D. Goodman, Rohin Shah

TL;DR

This work investigates scalable oversight by evaluating debate and consultancy protocols where weak LLM judges supervise stronger LLMs across a diverse set of tasks, including extractive QA with information asymmetry, closed QA, and multimodal reasoning. Through a large-scale, inference-only study using multiple models (including Gemma7B, GPT-3.5, Gemini Pro variants) and both assigned- and open-role protocols, the authors measure judge accuracy, debater persuasiveness (via Elo), and a range of ablations (turns, best-of-N, few-shot, chain-of-thought, and turn order). Key findings show that debate consistently outperforms consultancy, while QA baselines are task-dependent; information asymmetry in extractive tasks can make QA with article the strongest baseline, whereas debate yields robust gains in other settings. Open variants reveal that exposing either side to choice (open consultancy vs open debate) changes the training signal, with open debate reducing the risk of amplifying incorrect arguments. Overall, stronger debaters tend to yield higher judge accuracy, suggesting a weakly positive trend for debate as a scalable oversight mechanism, albeit with limitations and substantial avenues for future training-oriented research, including model fine-tuning on judging debates and human-in-the-loop evaluations.

Abstract

Scalable oversight protocols aim to enable humans to accurately supervise superhuman AI. In this paper we study debate, where two AI's compete to convince a judge; consultancy, where a single AI tries to convince a judge that asks questions; and compare to a baseline of direct question-answering, where the judge just answers outright without the AI. We use large language models (LLMs) as both AI agents and as stand-ins for human judges, taking the judge models to be weaker than agent models. We benchmark on a diverse range of asymmetries between judges and agents, extending previous work on a single extractive QA task with information asymmetry, to also include mathematics, coding, logic and multimodal reasoning asymmetries. We find that debate outperforms consultancy across all tasks when the consultant is randomly assigned to argue for the correct/incorrect answer. Comparing debate to direct question answering, the results depend on the type of task: in extractive QA tasks with information asymmetry debate outperforms direct question answering, but in other tasks without information asymmetry the results are mixed. Previous work assigned debaters/consultants an answer to argue for. When we allow them to instead choose which answer to argue for, we find judges are less frequently convinced by the wrong answer in debate than in consultancy. Further, we find that stronger debater models increase judge accuracy, though more modestly than in previous studies.

On scalable oversight with weak LLMs judging strong LLMs

TL;DR

This work investigates scalable oversight by evaluating debate and consultancy protocols where weak LLM judges supervise stronger LLMs across a diverse set of tasks, including extractive QA with information asymmetry, closed QA, and multimodal reasoning. Through a large-scale, inference-only study using multiple models (including Gemma7B, GPT-3.5, Gemini Pro variants) and both assigned- and open-role protocols, the authors measure judge accuracy, debater persuasiveness (via Elo), and a range of ablations (turns, best-of-N, few-shot, chain-of-thought, and turn order). Key findings show that debate consistently outperforms consultancy, while QA baselines are task-dependent; information asymmetry in extractive tasks can make QA with article the strongest baseline, whereas debate yields robust gains in other settings. Open variants reveal that exposing either side to choice (open consultancy vs open debate) changes the training signal, with open debate reducing the risk of amplifying incorrect arguments. Overall, stronger debaters tend to yield higher judge accuracy, suggesting a weakly positive trend for debate as a scalable oversight mechanism, albeit with limitations and substantial avenues for future training-oriented research, including model fine-tuning on judging debates and human-in-the-loop evaluations.

Abstract

Scalable oversight protocols aim to enable humans to accurately supervise superhuman AI. In this paper we study debate, where two AI's compete to convince a judge; consultancy, where a single AI tries to convince a judge that asks questions; and compare to a baseline of direct question-answering, where the judge just answers outright without the AI. We use large language models (LLMs) as both AI agents and as stand-ins for human judges, taking the judge models to be weaker than agent models. We benchmark on a diverse range of asymmetries between judges and agents, extending previous work on a single extractive QA task with information asymmetry, to also include mathematics, coding, logic and multimodal reasoning asymmetries. We find that debate outperforms consultancy across all tasks when the consultant is randomly assigned to argue for the correct/incorrect answer. Comparing debate to direct question answering, the results depend on the type of task: in extractive QA tasks with information asymmetry debate outperforms direct question answering, but in other tasks without information asymmetry the results are mixed. Previous work assigned debaters/consultants an answer to argue for. When we allow them to instead choose which answer to argue for, we find judges are less frequently convinced by the wrong answer in debate than in consultancy. Further, we find that stronger debater models increase judge accuracy, though more modestly than in previous studies.
Paper Structure (74 sections, 2 equations, 30 figures, 1 table)

This paper contains 74 sections, 2 equations, 30 figures, 1 table.

Figures (30)

  • Figure 1: Our setup. We evaluate on three types of task (top row). Extractive, where there is a question, two answer options and a source article to extract from, and information-asymmetry, meaning that judges don't get to see the article. Closed, where there is just a question and two answer options. Multimodal, where the questions involve both text and images, and two answer options. We consider six protocols (middle and bottom rows): Consultancy, where a single AI is assigned the correct/incorrect answer (with probability 50/50) and tries to convince a judge that asks questions; Open consultancy, which is similar except the AI chooses which answer to argue for. Debate, where two AIs compete to convince a judge. Open debate, which is identical except one debater, marked the protagonist, chooses which answer to argue for. QA without article, where the judge directly answers the question. QA with article, (only on extractive tasks) where the judge directly answers the question given the article.
  • Figure 2: Assigned-role results: mean judge accuracy (y-axis) split by task type (facet), judge model (x-axis), protocol (colour). Higher is better. 95% CI calculated aggregated over tasks of same type (\ref{['app:error_bars']} for details). The QA with article protocol (purple) can only be applied for extractive tasks. Only Pro 1.5 is multimodal.
  • Figure 3: Open debate, where a protagonist debater chooses which answer to argue for, and open consultancy, where the consultant chooses which answer to argue for. Top: Judge accuracy (y-axis) and win rate of protagonist/consultant (x-axis). Blue colours indicate open consultancy, red colours indicate open debate, with the shade corresponding to judge model. Each facet is task type. Bottom: Judge accuracy according to whether the protagonist/consultant chose the correct (dark) or incorrect (light) answer. Split by judge model (x-axis) and protocol: open consultancy and open debate. Each facet is task type. 95% CIs in all plots. The consultant/protagonist chooses the correct answer in 88%, 84%, 71% of questions, for extractive, closed and multimodal tasks respectively.
  • Figure 4: Top: Elo of debaters, coloured by model, separated by whether they're assigned to argue for the correct (dark) or incorrect (light) answer. Middle: Correct answer advantage (correct debater's Elo - incorrect debater's Elo) vs. aggregate debater Elo. Bottom: Plot of Pro 1.5 judge accuracy vs. debater aggregate Elo scores. 95% CIs. Judge decisions are more sensitive to debaters on extractive tasks than closed tasks.
  • Figure A.1: The statistical significance of differences between protocols. Permutation-based $p$-values of accuracy differences between protocols, for each task type and judge. To increase statistical power, we use the paired permutation test because judges and protocols are evaluated on the same set of data examples. The $p$-values are obtained using Scipy's permutation_test with permutation_type='samples' and 10,000 resamples. Note that the minimum $p$-value is limited by the number of samples.
  • ...and 25 more figures