Table of Contents
Fetching ...

Efficient Non-Parametric Uncertainty Quantification for Black-Box Large Language Models and Decision Planning

Yao-Hung Hubert Tsai, Walter Talbott, Jian Zhang

TL;DR

The paper tackles uncertainty quantification for black-box LLMs in step-by-step decision planning to reduce hallucinations. It introduces a non-parametric, neural estimator of point-wise dependency $r(a,x)=\frac{p(a,x)}{p(a)p(x)}$ that can be evaluated in a single inference without token logits, and extends it to include action-history via $X'$. A complete pipeline is described: data collection (20k prompt-action pairs), instruction fine-tuning for a step-by-step decision-making agent, and a point-wise dependency estimator trained with conformal-prediction calibration to trigger user input when needed. Empirical results show step-by-step planning outperforms all-at-once generation in F1 score, with thresholding via conformal prediction balancing precision and recall; the approach demonstrates a cost-efficient path to leveraging proprietary LLMs for interactive, multi-turn decision making.

Abstract

Step-by-step decision planning with large language models (LLMs) is gaining attention in AI agent development. This paper focuses on decision planning with uncertainty estimation to address the hallucination problem in language models. Existing approaches are either white-box or computationally demanding, limiting use of black-box proprietary LLMs within budgets. The paper's first contribution is a non-parametric uncertainty quantification method for LLMs, efficiently estimating point-wise dependencies between input-decision on the fly with a single inference, without access to token logits. This estimator informs the statistical interpretation of decision trustworthiness. The second contribution outlines a systematic design for a decision-making agent, generating actions like ``turn on the bathroom light'' based on user prompts such as ``take a bath''. Users will be asked to provide preferences when more than one action has high estimated point-wise dependencies. In conclusion, our uncertainty estimation and decision-making agent design offer a cost-efficient approach for AI agent development.

Efficient Non-Parametric Uncertainty Quantification for Black-Box Large Language Models and Decision Planning

TL;DR

The paper tackles uncertainty quantification for black-box LLMs in step-by-step decision planning to reduce hallucinations. It introduces a non-parametric, neural estimator of point-wise dependency that can be evaluated in a single inference without token logits, and extends it to include action-history via . A complete pipeline is described: data collection (20k prompt-action pairs), instruction fine-tuning for a step-by-step decision-making agent, and a point-wise dependency estimator trained with conformal-prediction calibration to trigger user input when needed. Empirical results show step-by-step planning outperforms all-at-once generation in F1 score, with thresholding via conformal prediction balancing precision and recall; the approach demonstrates a cost-efficient path to leveraging proprietary LLMs for interactive, multi-turn decision making.

Abstract

Step-by-step decision planning with large language models (LLMs) is gaining attention in AI agent development. This paper focuses on decision planning with uncertainty estimation to address the hallucination problem in language models. Existing approaches are either white-box or computationally demanding, limiting use of black-box proprietary LLMs within budgets. The paper's first contribution is a non-parametric uncertainty quantification method for LLMs, efficiently estimating point-wise dependencies between input-decision on the fly with a single inference, without access to token logits. This estimator informs the statistical interpretation of decision trustworthiness. The second contribution outlines a systematic design for a decision-making agent, generating actions like ``turn on the bathroom light'' based on user prompts such as ``take a bath''. Users will be asked to provide preferences when more than one action has high estimated point-wise dependencies. In conclusion, our uncertainty estimation and decision-making agent design offer a cost-efficient approach for AI agent development.
Paper Structure (12 sections, 4 equations, 3 figures, 2 tables)

This paper contains 12 sections, 4 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: A decision-making agent design. During the data collection phase, we curate a dataset comprising 20,000 pairs associating user requests with smart home actions, recognizing the potential for multiple actions per request. In the subsequent model training stage, we conduct instruction fine-tuning. The agent, utilizing a robust language model, generates a comprehensive set of actions based on user requests and prior actions. Additionally, we train a point-wise dependency neural estimator, establishing relationships among user requests, historical actions, and the current action. Moving to the deployment stage, we integrate the decision-making agent with the neural estimator to enumerate potential actions guided by point-wise dependencies exceeding a threshold determined from calibration data. User interaction occurs when multiple actions are enumerated, prompting user selection. For a single enumerated action, the agent executes it directly, and in the absence of any enumerated actions, the agent ceases operation. We use green color to denote the smart home, blue color to denote the decision-making agent, yellow color to denote the point-wise dependency neural estimator, gradients of the blue and the yellow color to denote the combination between the agent and the neural estimator, and pink color to denote the user.
  • Figure 2: Distributions of estimated point-wise dependency between user prompt, taken actions, and current action.
  • Figure 3: Conformal prediction on calibration data. We define the non-conformity score as $50$ minus the estimated point-wise dependency. Conformal prediction identifies the $80\%$ quantile value to establish the threshold at $1.627$. Consequently, during test scenarios, we secure a statistical guarantee that the probability of the "true action lying in the generated actions with scores greater than $1.627$" is greater than $80\%$.