Benchmarking Large Language Model Uncertainty for Prompt Optimization
Pei-Fu Guo, Yun-Da Tsai, Shou-De Lin
TL;DR
This work tackles the gap in uncertainty estimation for prompt optimization in large language models by introducing a benchmarking pipeline that constructs large tree-structured reasoning traces through prompt perturbations and Monte Carlo sampling to establish ground-truth uncertainties across four types: AnsU, CU, AU, and EU. It evaluates current NLG uncertainty metrics (NPE, LNPE, Top-DISP, Intra) on GSM8K and StrategyQA using GPT-3.5-Turbo and Meta-Llama-3.1-8B-Instruct, finding that most metrics align with AnsU/AU rather than CU. The results reveal a significant gap: while existing metrics reflect output confidence and diversity, they poorly estimate correctness uncertainty, which is crucial for prompt optimization tasks aimed at accurate answers. The authors provide code and data at the referenced GitHub repository and highlight the need for optimization-objective-aware uncertainty estimators to better guide prompt search strategies in LLMs.
Abstract
Prompt optimization algorithms for Large Language Models (LLMs) excel in multi-step reasoning but still lack effective uncertainty estimation. This paper introduces a benchmark dataset to evaluate uncertainty metrics, focusing on Answer, Correctness, Aleatoric, and Epistemic Uncertainty. Through analysis of models like GPT-3.5-Turbo and Meta-Llama-3.1-8B-Instruct, we show that current metrics align more with Answer Uncertainty, which reflects output confidence and diversity, rather than Correctness Uncertainty, highlighting the need for improved metrics that are optimization-objective-aware to better guide prompt optimization. Our code and dataset are available at https://github.com/0Frett/PO-Uncertainty-Benchmarking.
