Table of Contents
Fetching ...

DebUnc: Improving Large Language Model Agent Communication With Uncertainty Metrics

Luke Yoffe, Alfonso Amayuelas, William Yang Wang

TL;DR

DebUnc tackles the problem of overconfident, incorrect LLM outputs in multi-agent debates by quantifying agent uncertainty and communicating it to peers. It introduces two uncertainty communication strategies—prompt-based confidence signaling and a novel attention-scaling mechanism that biases token generation toward more confident agents. Empirical results across multiple LLMs and benchmarks show that attention scaling, particularly Attention-All, delivers the strongest improvements and scales with the quality of the uncertainty metric (Oracle being an idealized bound). The work highlights a practical path to more reliable cooperative reasoning in LLM systems and provides a foundation for developing more robust uncertainty metrics.

Abstract

Multi-agent debates have been introduced to improve the accuracy of Large Language Models (LLMs) by having multiple agents discuss solutions to a problem over several rounds of debate. However, models often generate incorrect yet confident-sounding responses, which can mislead others. This issue arises partly because agents do not consider how confident their peers are. To address this, we propose DebUnc, a debate framework that uses uncertainty metrics to assess agent confidence. Confidence is then conveyed through a modified attention mechanism that adjusts token weights, or through textual prompts. Evaluations across benchmarks show that attention-based methods are particularly effective and that performance continues to improve as uncertainty estimation becomes more reliable. The code is available at https://github.com/lukeyoffe/debunc.

DebUnc: Improving Large Language Model Agent Communication With Uncertainty Metrics

TL;DR

DebUnc tackles the problem of overconfident, incorrect LLM outputs in multi-agent debates by quantifying agent uncertainty and communicating it to peers. It introduces two uncertainty communication strategies—prompt-based confidence signaling and a novel attention-scaling mechanism that biases token generation toward more confident agents. Empirical results across multiple LLMs and benchmarks show that attention scaling, particularly Attention-All, delivers the strongest improvements and scales with the quality of the uncertainty metric (Oracle being an idealized bound). The work highlights a practical path to more reliable cooperative reasoning in LLM systems and provides a foundation for developing more robust uncertainty metrics.

Abstract

Multi-agent debates have been introduced to improve the accuracy of Large Language Models (LLMs) by having multiple agents discuss solutions to a problem over several rounds of debate. However, models often generate incorrect yet confident-sounding responses, which can mislead others. This issue arises partly because agents do not consider how confident their peers are. To address this, we propose DebUnc, a debate framework that uses uncertainty metrics to assess agent confidence. Confidence is then conveyed through a modified attention mechanism that adjusts token weights, or through textual prompts. Evaluations across benchmarks show that attention-based methods are particularly effective and that performance continues to improve as uncertainty estimation becomes more reliable. The code is available at https://github.com/lukeyoffe/debunc.
Paper Structure (21 sections, 6 equations, 5 figures, 4 tables)

This paper contains 21 sections, 6 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Example three-agent debate. The first agent initially provides an incorrect response but corrects itself after considering the answers and confidence levels of others. Each agent uses a LLM to generate its response and an uncertainty metric to assesses its confidence. Correct answers are shown in green, while incorrect ones are shown in red.
  • Figure 2: Illustration of the modified multi-agent debate involving three agents. In the first round, each agent independently generates a response to the question, which is evaluated for confidence using an uncertainty metric. The prompt for following rounds includes the responses from other agents in the previous round. Sections of the prompt highlighted in green are used only with the \ref{['confinprompt']} method. Each agent retains access to its complete chat history throughout the debate. After the final round, a majority vote determines the final answer.
  • Figure 3: Illustration of the \ref{['attnall']} method from the perspective of Agent 1. As the second debate round begins, the model's context includes the initial prompt and each agent's responses. Agent 2 provided a correct response with lower uncertainty than Agents 1 and 3, who responded incorrectly. Because Agent 2 had a lower uncertainty, the attention weights for tokens constituting Agent 2’s response will be increased, while those for tokens from Agent 1 and Agent 3's responses will be decreased. This led Agent 1 to switch to the correct answer.
  • Figure 4: Plots showing the percent increase in accuracy over standard debate versus uncertainty metric AUROC for a given combination of benchmark, uncertainty metric, and trial using Mistral-7B. A higher AUROC indicates better metric performance. The plots are titled by uncertainty incorporation method and color-coded by the uncertainty metric used. The trendlines show that attention-based methods, especially Attention-All, lead to more substantial performance gains as AUROC increases compared to methods that incorporate confidence directly into the prompt.
  • Figure 5: Distribution of uncertainties for correct and incorrect answers across all Mistral-7B experiments, as measured by the uncertainty metrics Mean Token Entropy and TokenSAR. Generally, correct answers exhibit lower uncertainties than incorrect ones, indicating that although not perfect, uncertainty metrics are useful for distinguishing between accurate responses and those where the agent may be hallucinating.