Table of Contents
Fetching ...

Evaluating Uncertainty Quantification Methods in Argumentative Large Language Models

Kevin Zhou, Adam Dejl, Gabriel Freedman, Lihu Chen, Antonio Rago, Francesca Toni

TL;DR

This study examines uncertainty quantification for argumentative LLMs (ArgLLMs) in claim verification by evaluating four UQ methods—Direct Prompting, Semantic Entropy, Eccentricity, and LUQ—within a QBAF-based argumentative framework that uses DF-QuAD gradual semantics to decide truth. ArgLLMs compute base scores from UQ outputs, combine supporting and attacking arguments, and predict truth when the final DF-QuAD score exceeds $0.5$, enabling downstream calibration without ground-truth argument labels. Across three datasets (TruthfulClaim, StrategyClaim, MedClaim) and a 36-configuration design, direct prompting consistently achieves the highest accuracy, with LUQ offering competitive results in some configurations, albeit at higher computational cost. The results underscore the value of prompt-based UQ in long, contentious argumentative generation and position ArgLLMs as a robust benchmark for evaluating LLM UQ methods in explainable decision-making tasks.

Abstract

Research in uncertainty quantification (UQ) for large language models (LLMs) is increasingly important towards guaranteeing the reliability of this groundbreaking technology. We explore the integration of LLM UQ methods in argumentative LLMs (ArgLLMs), an explainable LLM framework for decision-making based on computational argumentation in which UQ plays a critical role. We conduct experiments to evaluate ArgLLMs' performance on claim verification tasks when using different LLM UQ methods, inherently performing an assessment of the UQ methods' effectiveness. Moreover, the experimental procedure itself is a novel way of evaluating the effectiveness of UQ methods, especially when intricate and potentially contentious statements are present. Our results demonstrate that, despite its simplicity, direct prompting is an effective UQ strategy in ArgLLMs, outperforming considerably more complex approaches.

Evaluating Uncertainty Quantification Methods in Argumentative Large Language Models

TL;DR

This study examines uncertainty quantification for argumentative LLMs (ArgLLMs) in claim verification by evaluating four UQ methods—Direct Prompting, Semantic Entropy, Eccentricity, and LUQ—within a QBAF-based argumentative framework that uses DF-QuAD gradual semantics to decide truth. ArgLLMs compute base scores from UQ outputs, combine supporting and attacking arguments, and predict truth when the final DF-QuAD score exceeds , enabling downstream calibration without ground-truth argument labels. Across three datasets (TruthfulClaim, StrategyClaim, MedClaim) and a 36-configuration design, direct prompting consistently achieves the highest accuracy, with LUQ offering competitive results in some configurations, albeit at higher computational cost. The results underscore the value of prompt-based UQ in long, contentious argumentative generation and position ArgLLMs as a robust benchmark for evaluating LLM UQ methods in explainable decision-making tasks.

Abstract

Research in uncertainty quantification (UQ) for large language models (LLMs) is increasingly important towards guaranteeing the reliability of this groundbreaking technology. We explore the integration of LLM UQ methods in argumentative LLMs (ArgLLMs), an explainable LLM framework for decision-making based on computational argumentation in which UQ plays a critical role. We conduct experiments to evaluate ArgLLMs' performance on claim verification tasks when using different LLM UQ methods, inherently performing an assessment of the UQ methods' effectiveness. Moreover, the experimental procedure itself is a novel way of evaluating the effectiveness of UQ methods, especially when intricate and potentially contentious statements are present. Our results demonstrate that, despite its simplicity, direct prompting is an effective UQ strategy in ArgLLMs, outperforming considerably more complex approaches.

Paper Structure

This paper contains 23 sections, 4 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Example of argumentative LLMs, where UQ plays a crucial role in estimating confidence in the generated arguments and thus in the claim verification itself. Here, the input claim is derived from the TruthfulQA dataset lin2022truthfulqameasuringmodelsmimic, arguments are generated by Llama 3.1 grattafiori2024llama3herdmodels, and a default base score 0.50 is used for the input claim.
  • Figure 2: The prompt used in the direct prompting method to obtain confidence scores for the generated supporting and attacking arguments (reproduced from freedman2024).
  • Figure 3: Prompt modification for the generation of supporting and attacking arguments, with the prompt from freedman2024 in the top box and the new prompt we use in the bottom box. The changed portion is shown in bold.