Benchmarking Large Language Model Uncertainty for Prompt Optimization

Pei-Fu Guo; Yun-Da Tsai; Shou-De Lin

Benchmarking Large Language Model Uncertainty for Prompt Optimization

Pei-Fu Guo, Yun-Da Tsai, Shou-De Lin

TL;DR

This work tackles the gap in uncertainty estimation for prompt optimization in large language models by introducing a benchmarking pipeline that constructs large tree-structured reasoning traces through prompt perturbations and Monte Carlo sampling to establish ground-truth uncertainties across four types: AnsU, CU, AU, and EU. It evaluates current NLG uncertainty metrics (NPE, LNPE, Top-DISP, Intra) on GSM8K and StrategyQA using GPT-3.5-Turbo and Meta-Llama-3.1-8B-Instruct, finding that most metrics align with AnsU/AU rather than CU. The results reveal a significant gap: while existing metrics reflect output confidence and diversity, they poorly estimate correctness uncertainty, which is crucial for prompt optimization tasks aimed at accurate answers. The authors provide code and data at the referenced GitHub repository and highlight the need for optimization-objective-aware uncertainty estimators to better guide prompt search strategies in LLMs.

Abstract

Prompt optimization algorithms for Large Language Models (LLMs) excel in multi-step reasoning but still lack effective uncertainty estimation. This paper introduces a benchmark dataset to evaluate uncertainty metrics, focusing on Answer, Correctness, Aleatoric, and Epistemic Uncertainty. Through analysis of models like GPT-3.5-Turbo and Meta-Llama-3.1-8B-Instruct, we show that current metrics align more with Answer Uncertainty, which reflects output confidence and diversity, rather than Correctness Uncertainty, highlighting the need for improved metrics that are optimization-objective-aware to better guide prompt optimization. Our code and dataset are available at https://github.com/0Frett/PO-Uncertainty-Benchmarking.

Benchmarking Large Language Model Uncertainty for Prompt Optimization

TL;DR

Abstract

Paper Structure (12 sections, 7 equations, 6 figures, 1 algorithm)

This paper contains 12 sections, 7 equations, 6 figures, 1 algorithm.

Introduction
Different Uncertainties for Prompt Optimization
Current NLG Uncertainty Metrics
Benchmarking Pipeline
Design Concept
Detailed Workflow
Experiments
Dataset and LLMs
Results and Analysis
Conclusion
Prompt Templates
Additional Results

Figures (6)

Figure 1: Correlation Maps of uncertainty metrics and target uncertainty on different datasets and models.
Figure 2: Example prompt for question perturbation.
Figure 3: Example prompt for StrategyQA. We random pick 4 few shot samples from pool.
Figure 4: Example prompt for GSM8K. We random pick 4 few shot samples from pool.
Figure 5: Scatter plots show the evaluation results of metrics on Llama (StrategyQA and GSM8K) and GPT-3.5-Turbo (GSM8K and StrategyQA), with each point representing a reasoning node. The plots illustrate the relationship between CU, uncertainty metrics, and response accuracy. As shown, most metrics fail to estimate CU effectively, as there is no clear trend of higher metric values(x-axis) corresponding to higher correctness uncertainty(y-axis).
...and 1 more figures

Benchmarking Large Language Model Uncertainty for Prompt Optimization

TL;DR

Abstract

Benchmarking Large Language Model Uncertainty for Prompt Optimization

Authors

TL;DR

Abstract

Table of Contents

Figures (6)