Table of Contents
Fetching ...

Benchmarking Large Language Model Uncertainty for Prompt Optimization

Pei-Fu Guo, Yun-Da Tsai, Shou-De Lin

TL;DR

This work tackles the gap in uncertainty estimation for prompt optimization in large language models by introducing a benchmarking pipeline that constructs large tree-structured reasoning traces through prompt perturbations and Monte Carlo sampling to establish ground-truth uncertainties across four types: AnsU, CU, AU, and EU. It evaluates current NLG uncertainty metrics (NPE, LNPE, Top-DISP, Intra) on GSM8K and StrategyQA using GPT-3.5-Turbo and Meta-Llama-3.1-8B-Instruct, finding that most metrics align with AnsU/AU rather than CU. The results reveal a significant gap: while existing metrics reflect output confidence and diversity, they poorly estimate correctness uncertainty, which is crucial for prompt optimization tasks aimed at accurate answers. The authors provide code and data at the referenced GitHub repository and highlight the need for optimization-objective-aware uncertainty estimators to better guide prompt search strategies in LLMs.

Abstract

Prompt optimization algorithms for Large Language Models (LLMs) excel in multi-step reasoning but still lack effective uncertainty estimation. This paper introduces a benchmark dataset to evaluate uncertainty metrics, focusing on Answer, Correctness, Aleatoric, and Epistemic Uncertainty. Through analysis of models like GPT-3.5-Turbo and Meta-Llama-3.1-8B-Instruct, we show that current metrics align more with Answer Uncertainty, which reflects output confidence and diversity, rather than Correctness Uncertainty, highlighting the need for improved metrics that are optimization-objective-aware to better guide prompt optimization. Our code and dataset are available at https://github.com/0Frett/PO-Uncertainty-Benchmarking.

Benchmarking Large Language Model Uncertainty for Prompt Optimization

TL;DR

This work tackles the gap in uncertainty estimation for prompt optimization in large language models by introducing a benchmarking pipeline that constructs large tree-structured reasoning traces through prompt perturbations and Monte Carlo sampling to establish ground-truth uncertainties across four types: AnsU, CU, AU, and EU. It evaluates current NLG uncertainty metrics (NPE, LNPE, Top-DISP, Intra) on GSM8K and StrategyQA using GPT-3.5-Turbo and Meta-Llama-3.1-8B-Instruct, finding that most metrics align with AnsU/AU rather than CU. The results reveal a significant gap: while existing metrics reflect output confidence and diversity, they poorly estimate correctness uncertainty, which is crucial for prompt optimization tasks aimed at accurate answers. The authors provide code and data at the referenced GitHub repository and highlight the need for optimization-objective-aware uncertainty estimators to better guide prompt search strategies in LLMs.

Abstract

Prompt optimization algorithms for Large Language Models (LLMs) excel in multi-step reasoning but still lack effective uncertainty estimation. This paper introduces a benchmark dataset to evaluate uncertainty metrics, focusing on Answer, Correctness, Aleatoric, and Epistemic Uncertainty. Through analysis of models like GPT-3.5-Turbo and Meta-Llama-3.1-8B-Instruct, we show that current metrics align more with Answer Uncertainty, which reflects output confidence and diversity, rather than Correctness Uncertainty, highlighting the need for improved metrics that are optimization-objective-aware to better guide prompt optimization. Our code and dataset are available at https://github.com/0Frett/PO-Uncertainty-Benchmarking.
Paper Structure (12 sections, 7 equations, 6 figures, 1 algorithm)

This paper contains 12 sections, 7 equations, 6 figures, 1 algorithm.

Figures (6)

  • Figure 1: Correlation Maps of uncertainty metrics and target uncertainty on different datasets and models.
  • Figure 2: Example prompt for question perturbation.
  • Figure 3: Example prompt for StrategyQA. We random pick 4 few shot samples from pool.
  • Figure 4: Example prompt for GSM8K. We random pick 4 few shot samples from pool.
  • Figure 5: Scatter plots show the evaluation results of metrics on Llama (StrategyQA and GSM8K) and GPT-3.5-Turbo (GSM8K and StrategyQA), with each point representing a reasoning node. The plots illustrate the relationship between CU, uncertainty metrics, and response accuracy. As shown, most metrics fail to estimate CU effectively, as there is no clear trend of higher metric values(x-axis) corresponding to higher correctness uncertainty(y-axis).
  • ...and 1 more figures