Evaluating Uncertainty Quantification Methods in Argumentative Large Language Models

Kevin Zhou; Adam Dejl; Gabriel Freedman; Lihu Chen; Antonio Rago; Francesca Toni

Evaluating Uncertainty Quantification Methods in Argumentative Large Language Models

Kevin Zhou, Adam Dejl, Gabriel Freedman, Lihu Chen, Antonio Rago, Francesca Toni

TL;DR

This study examines uncertainty quantification for argumentative LLMs (ArgLLMs) in claim verification by evaluating four UQ methods—Direct Prompting, Semantic Entropy, Eccentricity, and LUQ—within a QBAF-based argumentative framework that uses DF-QuAD gradual semantics to decide truth. ArgLLMs compute base scores from UQ outputs, combine supporting and attacking arguments, and predict truth when the final DF-QuAD score exceeds $0.5$, enabling downstream calibration without ground-truth argument labels. Across three datasets (TruthfulClaim, StrategyClaim, MedClaim) and a 36-configuration design, direct prompting consistently achieves the highest accuracy, with LUQ offering competitive results in some configurations, albeit at higher computational cost. The results underscore the value of prompt-based UQ in long, contentious argumentative generation and position ArgLLMs as a robust benchmark for evaluating LLM UQ methods in explainable decision-making tasks.

Abstract

Research in uncertainty quantification (UQ) for large language models (LLMs) is increasingly important towards guaranteeing the reliability of this groundbreaking technology. We explore the integration of LLM UQ methods in argumentative LLMs (ArgLLMs), an explainable LLM framework for decision-making based on computational argumentation in which UQ plays a critical role. We conduct experiments to evaluate ArgLLMs' performance on claim verification tasks when using different LLM UQ methods, inherently performing an assessment of the UQ methods' effectiveness. Moreover, the experimental procedure itself is a novel way of evaluating the effectiveness of UQ methods, especially when intricate and potentially contentious statements are present. Our results demonstrate that, despite its simplicity, direct prompting is an effective UQ strategy in ArgLLMs, outperforming considerably more complex approaches.

Evaluating Uncertainty Quantification Methods in Argumentative Large Language Models

TL;DR

Abstract

Evaluating Uncertainty Quantification Methods in Argumentative Large Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)