Table of Contents
Fetching ...

AI Debate Aids Assessment of Controversial Claims

Salman Rahman, Sheriff Issaka, Ashima Suvarna, Genglin Liu, James Shiffer, Jaeyoung Lee, Md Rizwan Parvez, Hamid Palangi, Shi Feng, Nanyun Peng, Yejin Choi, Julian Michael, Liwei Jiang, Saadia Gabriel

TL;DR

This work investigates scalable oversight for frontier AI by comparing AI debate with consultancy in guiding humans with varying prior beliefs and by evaluating persona-based LLM judges as supervisory agents. Across COVID-19 and climate-change factuality tasks, AI debate improves final judgment accuracy and calibration, with the strongest gains for mainstream-belief judges and evidence of generalization to climate data. Persona-conditioned LLM judges outperform humans and non-personalized models, achieving notably higher accuracy in debate settings. The findings support deploying adversarial AI debate, paired with persona-aware automated judges, as a scalable and bias-resilient approach to supervising contentious factual assessments in high-stakes domains.

Abstract

As AI grows more powerful, it will increasingly shape how we understand the world. But with this influence comes the risk of amplifying misinformation and deepening social divides-especially on consequential topics where factual accuracy directly impacts well-being. Scalable Oversight aims to ensure AI systems remain truthful even when their capabilities exceed those of their evaluators. Yet when humans serve as evaluators, their own beliefs and biases can impair judgment. We study whether AI debate can guide biased judges toward the truth by having two AI systems debate opposing sides of controversial factuality claims on COVID-19 and climate change where people hold strong prior beliefs. We conduct two studies. Study I recruits human judges with either mainstream or skeptical beliefs who evaluate claims through two protocols: debate (interaction with two AI advisors arguing opposing sides) or consultancy (interaction with a single AI advisor). Study II uses AI judges with and without human-like personas to evaluate the same protocols. In Study I, debate consistently improves human judgment accuracy and confidence calibration, outperforming consultancy by 4-10% across COVID-19 and climate change claims. The improvement is most significant for judges with mainstream beliefs (up to +15.2% accuracy on COVID-19 claims), though debate also helps skeptical judges who initially misjudge claims move toward accurate views (+4.7% accuracy). In Study II, AI judges with human-like personas achieve even higher accuracy (78.5%) than human judges (70.1%) and default AI judges without personas (69.8%), suggesting their potential for supervising frontier AI models. These findings highlight AI debate as a promising path toward scalable, bias-resilient oversight in contested domains.

AI Debate Aids Assessment of Controversial Claims

TL;DR

This work investigates scalable oversight for frontier AI by comparing AI debate with consultancy in guiding humans with varying prior beliefs and by evaluating persona-based LLM judges as supervisory agents. Across COVID-19 and climate-change factuality tasks, AI debate improves final judgment accuracy and calibration, with the strongest gains for mainstream-belief judges and evidence of generalization to climate data. Persona-conditioned LLM judges outperform humans and non-personalized models, achieving notably higher accuracy in debate settings. The findings support deploying adversarial AI debate, paired with persona-aware automated judges, as a scalable and bias-resilient approach to supervising contentious factual assessments in high-stakes domains.

Abstract

As AI grows more powerful, it will increasingly shape how we understand the world. But with this influence comes the risk of amplifying misinformation and deepening social divides-especially on consequential topics where factual accuracy directly impacts well-being. Scalable Oversight aims to ensure AI systems remain truthful even when their capabilities exceed those of their evaluators. Yet when humans serve as evaluators, their own beliefs and biases can impair judgment. We study whether AI debate can guide biased judges toward the truth by having two AI systems debate opposing sides of controversial factuality claims on COVID-19 and climate change where people hold strong prior beliefs. We conduct two studies. Study I recruits human judges with either mainstream or skeptical beliefs who evaluate claims through two protocols: debate (interaction with two AI advisors arguing opposing sides) or consultancy (interaction with a single AI advisor). Study II uses AI judges with and without human-like personas to evaluate the same protocols. In Study I, debate consistently improves human judgment accuracy and confidence calibration, outperforming consultancy by 4-10% across COVID-19 and climate change claims. The improvement is most significant for judges with mainstream beliefs (up to +15.2% accuracy on COVID-19 claims), though debate also helps skeptical judges who initially misjudge claims move toward accurate views (+4.7% accuracy). In Study II, AI judges with human-like personas achieve even higher accuracy (78.5%) than human judges (70.1%) and default AI judges without personas (69.8%), suggesting their potential for supervising frontier AI models. These findings highlight AI debate as a promising path toward scalable, bias-resilient oversight in contested domains.

Paper Structure

This paper contains 45 sections, 3 equations, 17 figures, 6 tables.

Figures (17)

  • Figure 1: Human judge accuracy before and after debate versus consultancy interventions across COVID-19 and climate change domains. Each panel shows results for both domains side-by-side. Debate consistently outperforms consultancy: COVID-19 shows +10.0% overall advantage ($p<0.01$), with largest gains for mainstream believers (+15.2%, $p<0.01$) versus skeptical believers (+4.7%, $p\nless0.01$); Climate shows +3.8% overall advantage even when consultants argue for their preferred position (correct 92.5% of time) rather than randomly assigned positions (50% correct in COVID-19). Error bars show standard error.
  • Figure 2: Overview of experimental design for evaluating human supervision of AI systems on factuality claims. The flowchart depicts: (1) Initial Survey: Prolific screening, belief assessment of 1,650 participants (650 for COVID-19, 1,000 for climate change), with categorization into skeptical and mainstream belief groups with demographic information collection; (2) Protocols: judges then evaluate claims through either Consultancy (single AI advisor arguing a randomly assigned position, correct 50% of time) or Debate (opposing arguments from two AI advisors).
  • Figure 3: Percentage of debate/consultancy sessions for COVID-19 claims where judges transitioned between truthful ($\checkmark$) and non-truthful ($\times$) answers for (a) skeptical and (b) mainstream priors; (c)-(d) show confidence changes for each transition type for skeptical and mainstream priors. Error bars show standard error.
  • Figure 4: Impact of Initial Confidence on Protocol Effectiveness for COVID-19 claims. (a) Final accuracy rates comparing Debate vs. Consultancy protocols across different initial confidence levels. (b) Harmful update rates showing proportion of initially correct answers that became incorrect after intervention. (c) Beneficial update rates showing proportion of initially incorrect answers that became correct after intervention. Prior strength categories represent judge's initial confidence in their answer: Low (0-40%), Moderate (40-70%), and Strong (70-100%). Error bars show standard error.
  • Figure 5: Calibration plot for debate vs. consultancy protocols for human judges evaluating COVID-19 claims.
  • ...and 12 more figures