Confidence Calibration and Rationalization for LLMs via Multi-Agent Deliberation

Ruixin Yang; Dheeraj Rajagopal; Shirley Anugrah Hayati; Bin Hu; Dongyeop Kang

Confidence Calibration and Rationalization for LLMs via Multi-Agent Deliberation

Ruixin Yang, Dheeraj Rajagopal, Shirley Anugrah Hayati, Bin Hu, Dongyeop Kang

TL;DR

This paper addresses the poor confidence calibration of large language models (LLMs), especially after RLHF, by introducing Collaborative Calibration, a training-free, post-hoc method that uses a two-stage, multi-agent deliberation among tool-augmented LLM agents. In Stage 1, an expert-agent ensemble generates diverse stances and uncalibrated confidences; in Stage 2, general agents debate, rationalize, and critique these stances, producing refined posteriors and a final majority decision. The approach yields improved calibration across six generative QA tasks, demonstrated by lower Expected Calibration Error (ECE) and competitive Brier scores, without sacrificing accuracy or requiring fine-tuning. This work offers a scalable, interpretable pathway to more reliable LLM predictions in high-stakes settings by leveraging collective reasoning and rationalized confidence, potentially enhancing human-AI collaboration and trust.

Abstract

Uncertainty estimation is a significant issue for current large language models (LLMs) that are generally poorly calibrated and over-confident, especially with reinforcement learning from human feedback (RLHF). Unlike humans, whose decisions and confidences not only stem from intrinsic beliefs but can also be adjusted through daily observations, existing calibration methods for LLMs focus on estimating or eliciting individual confidence without taking full advantage of the "Collective Wisdom": the interaction among multiple LLMs that can collectively improve both accuracy and calibration. In this work, we propose Collaborative Calibration, a post-hoc training-free calibration strategy that leverages the collaborative and expressive capabilities of multiple tool-augmented LLM agents in a simulated group deliberation process. We demonstrate the effectiveness of Collaborative Calibration on generative QA tasks across various domains, showing its potential in harnessing the rationalization of collectively calibrated confidence assessments and improving the reliability of model predictions.

Confidence Calibration and Rationalization for LLMs via Multi-Agent Deliberation

TL;DR

Abstract

Paper Structure (14 sections, 4 equations, 4 figures, 3 tables)

This paper contains 14 sections, 4 equations, 4 figures, 3 tables.

Introduction
Related Work
Collaborative Calibration: Calibrating Confidence via Multi-Agent Deliberation
Agent Ensemble and Stance Generation
Group Deliberation with Rationales and Feedback
Experiments and Results
Conclusion
Appendix
Details on agent selection
Details on experiment setup
Datasets
Evaluation methods
Detailed Results
Prompt templates and example output

Figures (4)

Figure 1: High-level overview of the Collaborative Calibration pipeline.
Figure 2: Detailed illustration of the two-stage framework with a specific test example from the SciQ dataset.
Figure 3: Reliability diagrams comparing vanilla verbalized confidence + Self-consistency (M=6) and our Collaborative Calibration with an ensemble of 6 agents on GSM8K, SciQ, and DateUnd
Figure 4: Reliability diagrams comparing calibration performance before and after Stage 2 (group deliberation) on TriviaQA

Confidence Calibration and Rationalization for LLMs via Multi-Agent Deliberation

TL;DR

Abstract

Confidence Calibration and Rationalization for LLMs via Multi-Agent Deliberation

Authors

TL;DR

Abstract

Table of Contents

Figures (4)