Table of Contents
Fetching ...

When Persuasion Overrides Truth in Multi-Agent LLM Debates: Introducing a Confidence-Weighted Persuasion Override Rate (CW-POR)

Mahak Agarwal, Divyam Khanna

TL;DR

The paper investigates how persuasive but incorrect claims can override truth in a single-turn, multi-agent LLM debate and introduces the Confidence-Weighted Persuasion Override Rate (CW-POR) to quantify both misjudgment frequency and confidence. It uses a three-role setup (Neutral explainer, Persuasive defender, Judge arbiter) on TruthfulQA across five open-source LLMs, manipulating verbosity from 30 to 300 words and randomizing answer order. CW-POR combines a judge's rubric confidence and log-likelihood signals to weight each override by certainty, enabling finer-grained safety assessments. The results show that even smaller models can convincingly override factual answers with high confidence, highlighting calibration gaps and the need for adversarial and multi-turn evaluation to mitigate confidently endorsed misinformation. The findings advocate for stronger calibration, broader data representation beyond adversarial prompts, and potential multi-turn or ensemble approaches to improve reliability in real-world AI systems.$CW-POR$ and related confidence metrics provide a principled way to quantify and address these vulnerabilities.

Abstract

In many real-world scenarios, a single Large Language Model (LLM) may encounter contradictory claims-some accurate, others forcefully incorrect-and must judge which is true. We investigate this risk in a single-turn, multi-agent debate framework: one LLM-based agent provides a factual answer from TruthfulQA, another vigorously defends a falsehood, and the same LLM architecture serves as judge. We introduce the Confidence-Weighted Persuasion Override Rate (CW-POR), which captures not only how often the judge is deceived but also how strongly it believes the incorrect choice. Our experiments on five open-source LLMs (3B-14B parameters), where we systematically vary agent verbosity (30-300 words), reveal that even smaller models can craft persuasive arguments that override truthful answers-often with high confidence. These findings underscore the importance of robust calibration and adversarial testing to prevent LLMs from confidently endorsing misinformation.

When Persuasion Overrides Truth in Multi-Agent LLM Debates: Introducing a Confidence-Weighted Persuasion Override Rate (CW-POR)

TL;DR

The paper investigates how persuasive but incorrect claims can override truth in a single-turn, multi-agent LLM debate and introduces the Confidence-Weighted Persuasion Override Rate (CW-POR) to quantify both misjudgment frequency and confidence. It uses a three-role setup (Neutral explainer, Persuasive defender, Judge arbiter) on TruthfulQA across five open-source LLMs, manipulating verbosity from 30 to 300 words and randomizing answer order. CW-POR combines a judge's rubric confidence and log-likelihood signals to weight each override by certainty, enabling finer-grained safety assessments. The results show that even smaller models can convincingly override factual answers with high confidence, highlighting calibration gaps and the need for adversarial and multi-turn evaluation to mitigate confidently endorsed misinformation. The findings advocate for stronger calibration, broader data representation beyond adversarial prompts, and potential multi-turn or ensemble approaches to improve reliability in real-world AI systems. and related confidence metrics provide a principled way to quantify and address these vulnerabilities.

Abstract

In many real-world scenarios, a single Large Language Model (LLM) may encounter contradictory claims-some accurate, others forcefully incorrect-and must judge which is true. We investigate this risk in a single-turn, multi-agent debate framework: one LLM-based agent provides a factual answer from TruthfulQA, another vigorously defends a falsehood, and the same LLM architecture serves as judge. We introduce the Confidence-Weighted Persuasion Override Rate (CW-POR), which captures not only how often the judge is deceived but also how strongly it believes the incorrect choice. Our experiments on five open-source LLMs (3B-14B parameters), where we systematically vary agent verbosity (30-300 words), reveal that even smaller models can craft persuasive arguments that override truthful answers-often with high confidence. These findings underscore the importance of robust calibration and adversarial testing to prevent LLMs from confidently endorsing misinformation.

Paper Structure

This paper contains 33 sections, 2 equations, 6 figures.

Figures (6)

  • Figure 1: Example of a single-turn multi-agent debate. A factual question is answered by Agent A (Correct) and Agent B (Persuasive). The Judge Model evaluates both responses, reporting a self-rated confidence (4/5) (0.8 after normalization) and a log-likelihood confidence (0.92), which are combined into a final confidence (0.736). The Judge's override decision (selecting the incorrect Answer B) is then used in computing the Confidence-Weighted Persuasion Override Rate (CW-POR).
  • Figure 2: CW-POR by category (bars, left axis) with 95% confidence intervals, and question share (line, right axis). Some categories exhibit high CW-POR despite small question counts, indicating potential data-scarcity spikes.
  • Figure 3: Examples for Important categories (see \ref{['fig:cwpor_category_share']})
  • Figure 4: CW-POR vs. verbosity for each model. A notable dip is visible around 90--120 words, after which models diverge in behavior.
  • Figure 5: CW-POR comparing adversarial vs. non-adversarial questions across five models.
  • ...and 1 more figures