DiscoUQ: Structured Disagreement Analysis for Uncertainty Quantification in LLM Agent Ensembles

Bo Jiang

DiscoUQ: Structured Disagreement Analysis for Uncertainty Quantification in LLM Agent Ensembles

Bo Jiang

Abstract

Multi-agent LLM systems, where multiple prompted instances of a language model independently answer questions, are increasingly used for complex reasoning tasks. However, existing methods for quantifying the uncertainty of their collective outputs rely on shallow voting statistics that discard the rich semantic information in agents' reasoning. We introduce DiscoUQ, a framework that extracts and leverages the structure of inter-agent disagreement -- both linguistic properties (evidence overlap, argument strength, divergence depth) and embedding geometry (cluster distances, dispersion, cohesion) -- to produce well-calibrated confidence estimates. We propose three methods of increasing complexity: DiscoUQ-LLM (logistic regression on LLM-extracted structure features), DiscoUQ-Embed (logistic regression on embedding geometry), and DiscoUQ-Learn (a neural network combining all features). Evaluated on four diverse benchmarks (StrategyQA, MMLU, TruthfulQA, ARC-Challenge) with a 5-agent system using Qwen3.5-27B, DiscoUQ-LLM achieves an average AUROC of 0.802, outperforming the best baseline (LLM Aggregator, 0.791) while being substantially better calibrated (ECE 0.036 vs. 0.098). The learned features generalize across benchmarks with near-zero performance degradation and provide the largest improvements where they are most needed: in the ambiguous "weak disagreement" tier where simple vote counting fails.

DiscoUQ: Structured Disagreement Analysis for Uncertainty Quantification in LLM Agent Ensembles

Abstract

Paper Structure (46 sections, 5 figures, 6 tables)

This paper contains 46 sections, 5 figures, 6 tables.

Introduction
Related Work
Uncertainty quantification in LLMs.
Multi-agent debate and aggregation.
Ensemble methods and calibration.
Selective prediction.
Method
Multi-Agent System
Disagreement Structure Features
Linguistic Structure Features
Embedding Geometry Features
Uncertainty Quantification Methods
M1: DiscoUQ-LLM.
M2: DiscoUQ-Embed.
M3: DiscoUQ-Learn.
...and 31 more sections

Figures (5)

Figure 1: AUROC by disagreement tier across benchmarks. DiscoUQ methods show the largest improvements in the strong and weak tiers where vote counting provides the least information.
Figure 2: Expected calibration error (ECE) across benchmarks. Lower is better. All DiscoUQ methods achieve substantially lower ECE than baselines.
Figure 3: Accuracy-coverage curves across benchmarks. DiscoUQ methods maintain higher accuracy at all coverage levels, particularly on StrategyQA and TruthfulQA.
Figure 4: Cost (extra LLM calls) vs. performance (average AUROC) for all methods. M1 offers the best cost-performance tradeoff.
Figure 5: Feature importance from ablation studies. Majority confidence language and reasoning complexity are the most important LLM structure features.

DiscoUQ: Structured Disagreement Analysis for Uncertainty Quantification in LLM Agent Ensembles

Abstract

DiscoUQ: Structured Disagreement Analysis for Uncertainty Quantification in LLM Agent Ensembles

Authors

Abstract

Table of Contents

Figures (5)