When Persuasion Overrides Truth in Multi-Agent LLM Debates: Introducing a Confidence-Weighted Persuasion Override Rate (CW-POR)
Mahak Agarwal, Divyam Khanna
TL;DR
The paper investigates how persuasive but incorrect claims can override truth in a single-turn, multi-agent LLM debate and introduces the Confidence-Weighted Persuasion Override Rate (CW-POR) to quantify both misjudgment frequency and confidence. It uses a three-role setup (Neutral explainer, Persuasive defender, Judge arbiter) on TruthfulQA across five open-source LLMs, manipulating verbosity from 30 to 300 words and randomizing answer order. CW-POR combines a judge's rubric confidence and log-likelihood signals to weight each override by certainty, enabling finer-grained safety assessments. The results show that even smaller models can convincingly override factual answers with high confidence, highlighting calibration gaps and the need for adversarial and multi-turn evaluation to mitigate confidently endorsed misinformation. The findings advocate for stronger calibration, broader data representation beyond adversarial prompts, and potential multi-turn or ensemble approaches to improve reliability in real-world AI systems.$CW-POR$ and related confidence metrics provide a principled way to quantify and address these vulnerabilities.
Abstract
In many real-world scenarios, a single Large Language Model (LLM) may encounter contradictory claims-some accurate, others forcefully incorrect-and must judge which is true. We investigate this risk in a single-turn, multi-agent debate framework: one LLM-based agent provides a factual answer from TruthfulQA, another vigorously defends a falsehood, and the same LLM architecture serves as judge. We introduce the Confidence-Weighted Persuasion Override Rate (CW-POR), which captures not only how often the judge is deceived but also how strongly it believes the incorrect choice. Our experiments on five open-source LLMs (3B-14B parameters), where we systematically vary agent verbosity (30-300 words), reveal that even smaller models can craft persuasive arguments that override truthful answers-often with high confidence. These findings underscore the importance of robust calibration and adversarial testing to prevent LLMs from confidently endorsing misinformation.
