Table of Contents
Fetching ...

On the Importance of Uncertainty in Decision-Making with Large Language Models

Nicolò Felicioni, Lucas Maystre, Sina Ghiassian, Kamil Ciosek

TL;DR

The paper tackles decision-making with natural language contexts by framing it as a batch contextual bandit problem and assesses the impact of epistemic uncertainty when using large language models as reward predictors. It adapts scalable uncertainty estimation techniques—Dropout, Laplace Approximation (including recursive Hessian, diagonal, and Fisher variants), Last-Layer LA, and Epinet TS—to LLM-based bandits and compares them to a greedy baseline. Across four real-world text datasets (toxic/hate content, IMDb sentiment, and offensive language), TS-based policies consistently achieve lower final regret than greedy, with Last-Layer LA often delivering the strongest performance and Dropout TS providing strong results with minimal overhead; results generalize to GPT2-XL. The findings emphasize that uncertainty should play a central role in LLM-driven decision-making and point to future work in extending these methods to text-based reinforcement learning and broader scaling analyses.

Abstract

We investigate the role of uncertainty in decision-making problems with natural language as input. For such tasks, using Large Language Models as agents has become the norm. However, none of the recent approaches employ any additional phase for estimating the uncertainty the agent has about the world during the decision-making task. We focus on a fundamental decision-making framework with natural language as input, which is the one of contextual bandits, where the context information consists of text. As a representative of the approaches with no uncertainty estimation, we consider an LLM bandit with a greedy policy, which picks the action corresponding to the largest predicted reward. We compare this baseline to LLM bandits that make active use of uncertainty estimation by integrating the uncertainty in a Thompson Sampling policy. We employ different techniques for uncertainty estimation, such as Laplace Approximation, Dropout, and Epinets. We empirically show on real-world data that the greedy policy performs worse than the Thompson Sampling policies. These findings suggest that, while overlooked in the LLM literature, uncertainty plays a fundamental role in bandit tasks with LLMs.

On the Importance of Uncertainty in Decision-Making with Large Language Models

TL;DR

The paper tackles decision-making with natural language contexts by framing it as a batch contextual bandit problem and assesses the impact of epistemic uncertainty when using large language models as reward predictors. It adapts scalable uncertainty estimation techniques—Dropout, Laplace Approximation (including recursive Hessian, diagonal, and Fisher variants), Last-Layer LA, and Epinet TS—to LLM-based bandits and compares them to a greedy baseline. Across four real-world text datasets (toxic/hate content, IMDb sentiment, and offensive language), TS-based policies consistently achieve lower final regret than greedy, with Last-Layer LA often delivering the strongest performance and Dropout TS providing strong results with minimal overhead; results generalize to GPT2-XL. The findings emphasize that uncertainty should play a central role in LLM-driven decision-making and point to future work in extending these methods to text-based reinforcement learning and broader scaling analyses.

Abstract

We investigate the role of uncertainty in decision-making problems with natural language as input. For such tasks, using Large Language Models as agents has become the norm. However, none of the recent approaches employ any additional phase for estimating the uncertainty the agent has about the world during the decision-making task. We focus on a fundamental decision-making framework with natural language as input, which is the one of contextual bandits, where the context information consists of text. As a representative of the approaches with no uncertainty estimation, we consider an LLM bandit with a greedy policy, which picks the action corresponding to the largest predicted reward. We compare this baseline to LLM bandits that make active use of uncertainty estimation by integrating the uncertainty in a Thompson Sampling policy. We employ different techniques for uncertainty estimation, such as Laplace Approximation, Dropout, and Epinets. We empirically show on real-world data that the greedy policy performs worse than the Thompson Sampling policies. These findings suggest that, while overlooked in the LLM literature, uncertainty plays a fundamental role in bandit tasks with LLMs.
Paper Structure (45 sections, 28 equations, 12 figures, 1 table, 6 algorithms)

This paper contains 45 sections, 28 equations, 12 figures, 1 table, 6 algorithms.

Figures (12)

  • Figure 1: Average regret obtained on toxic content detection bandit task.
  • Figure 2: Average regret ($\pm$ std. err.) obtained on the toxic bandit task.
  • Figure 3: Average regret ($\pm$ std. err.) obtained on the imdb bandit task.
  • Figure 4: Average regret ($\pm$ std. err.) obtained on the offensive bandit task.
  • Figure 5: Average regret ($\pm$ std. err.) obtained on the hate bandit task.
  • ...and 7 more figures